Introduction

The robust identification of material culture variability remains an important goal in Palaeolithic archaeology, given that such variability is commonly coupled with interpretations of culture change, demography and adaptation. Material culture variability as seen in tool forms and assemblage structure is often readily agglomerated into higher-order units labelled ‘cultures’, ‘techno-complexes’, ‘groups’, ‘facies’, ‘industries’, ‘techno-units’ or the like (e.g. Gamble et al., 2005). For instance, in the Final Palaeolithic of Europe and the Paleoindian period of North America — what Williams and Madsen (2019) collectively refer to as the ‘Late Upper Palaeolithic’ — most such archaeological units are diagnosed, characterised and interpreted on the basis of artefact typology, chiefly that of lithic projectile points. Yet, critiques of such type-centred approaches abound (Bisson, 2000; Monnier & Missal, 2014), and recent applications of cultural evolutionary reasoning combined with computer-aided methods such as geometric morphometrics (GMM) have raised doubts as to the validity of many such entrenched groupings (Buchanan et al., 2020; Ivanovaitė et al., 2020; MacLeod, 2018; Serwatka & Riede, 2016). At the core of this emergent critique is the argument that traditional, essentialist typologies often fail to account for the internal variation seen in most qualitatively defined Palaeolithic artefact classes (cf. Shea, 2014; Wilkins, 2020) and that these typologies can easily reify preconceived differences (cf. O’Brien & Lyman, 2002). It has further been argued that the purported ‘cultural’ differences perpetuated through many traditional artefact typologies are more often than not the products of contingent disciplinary histories, rather than reflecting robust empirical realities (cf. Clark and Riel-Salvatore, 2006; Reynolds & Riede, 2019).

The analysis of artefact shapes through geometric morphometrics and cognate methods is by no means restricted to the Upper Palaeolithic, although the latter’s cultural taxonomic crisis may be particularly acute (see Reynolds & Riede, 2019). Acheulian handaxes (e.g. Costa, 2010; McNabb et al., 2018), the Middle Palaeolithic of Europe (e.g. Chacón et al., 2016; Serwatka, 2014) and elsewhere (Lycett, 2007), Holocene forager tools in South America (e.g. Charlin & González-José, 2012; Okumura & Araujo, 2014; Suárez & Cardillo, 2019) and later prehistoric metal axes (Wilczek et al., 2015), lances (Birch & Martinón-Torres, 2019) and adornments (Vestergaard & Hoggard, 2019) have all been subjected to geometric morphometric assessments, often leading to a substantial critique of established typologies. There is little methodological consistency across these emerging case studies, however, and rarely have such studies re-analysed different datasets comparatively (see Monnier & McNulty, 2010, for an exception), making it difficult to evaluate their conclusions independently of their methodological choices. With the goal of better understanding the cultural evolution and adaptation of Final Palaeolithic (ca. 15,000–11,000 cal BP) hunter gatherers in Europe (Riede et al., 2020) and in an effort to move towards a reliable, case-transferrable analytical toolkit for doing so, we here present a multi-step comparative exploration of data capture, outline-based geometric morphometric and clustering techniques. In the following, we re-analyse a series of published case studies using their lithic projectile point outline data in order to first develop and then validate our analytical protocol. This validation approach has four steps:

  1. 1.

    We benchmark whole-outline geometric morphometrics against a previously published landmark-based study aimed at discriminating two archaeological taxa in North America that were previously defined in terms of specific lithic point types (Buchanan et al., 2020).

  2. 2.

    We then similarly compare landmark-based geometric morphometrics and subsequent clustering as applied to a sample of Czech Bell Beaker projectile points (Petřík et al., 2018) with our whole-outline approach followed by hierarchical clustering. This comparison leads us to critically assess the choices of clustering algorithms.

  3. 3.

    Thereafter, we apply our protocol to a large sample of Late Neolithic/Early Bronze Age arrowheads from Northwestern Europe with well-constrained chronology (Nicolas, 2016). We show that our clustering protocol successfully replicates traditional techno-typological groupings and that the archaeological typo-chronology for this period, especially in North-West France, can be retrieved in a phenogram inferred from outline data using the neighbour joining (NJ) algorithm.

  4. 4.

    With this protocol in place, we finally turn to a dataset of Final Palaeolithic/earliest Mesolithic large tanged points from Europe (Ivanovaitė et al., 2020; Riede et al., 2019) in order to test groupings based on typological assessments, and to interrogate this sample in terms of the involved cultural evolutionary processes. We then compare the design space of the tanged points with the three previously re-examined lithic datasets by means of a combined principal component analysis (PCA). Doing so, we focus especially on the internal sum of variance which opens up new perspectives on rates of evolution and patterns of diversity in material culture in different time periods and cultural contexts.

The principal aim of our study is to develop a methodologically transparent protocol for the analysis of whole artefact outlines using GMM methods which can be used to establish and compare the shape variability of lithic artefacts through time and space. This approach facilitates the search for and the characterisation of structured variation within a given sample of artefacts therefore enabling the exploration of cultural transmission dynamics responsible for such variation.

Landmark and Outline Approaches in Geometric Morphometrics

Geometric morphometric approaches can be divided into methods that use landmarks and those that use trigonometric descriptions of whole outlines (Reyment, 1985; Temple, 1992), although the introduction of semi-landmarks and the dense placement of landmarks have led the two approaches to converge (MacLeod, 1999). Landmark approaches offer a straightforward way of delineating homologous structures, but their application also results in the loss of detailed shape information. In addition, the a priori identification of homologous landmarks on artefacts can be difficult and inherently subjective unless unambiguous theoretical expectations are available (Lycett, 2009). Therefore, outline approaches offer an alternative, robust and information-rich way of systematically capturing artefact shape data. Accurate artefact outlines can also be extracted efficiently from widely available legacy data, especially from artefact line drawings that populate the archaeological expert literature and excavation reports.

In archaeology, geometric morphometric approaches in general constitute a major advance vis-à-vis earlier qualitative descriptions, typological assessment or linear measurements of artefact morphologies (e.g. Serwatka & Riede, 2016; Shott & Trail, 2010). The bulk of archaeological applications of GMM have so far relied on landmark-based approaches (Cardillo, 2010; Okumura & Araujo, 2018), although a surge of recent studies is demonstrating the utility of whole-outline approaches using so-called elliptical Fourier analysis (EFA; Kuhl and Giardina, 1982) and cognate approaches (Fox, 2015; Lipo et al., 2016; Wang & Marwick, 2020). Many of these studies also address shape change trajectories and discuss classification issues in Palaeolithic stone tools (e.g. Iovita, 2009; Leplongeon et al., 2020; MacLeod, 2018; Mesfin et al., 2020).

The application of phylogenetics to then interpret such classification in terms of cultural evolutionary dynamics is now well-established (Lipo et al., 2006; e.g. Lycett, 2010; Mace et al., 2005; Mace & Pagel, 1997; O’Brien et al., 2008). It remains contested, however, as to whether object outlines can capture phylogenetic signals and to what extent they convey social information. In evolutionary biology, Zelditch et al. (1995), for example, have claimed that while landmarks can serve as valid characters for phylogenetic reconstructions, outlines normally fail to do so. Still others suggest that especially in cases where the unambiguous placement of homologous landmarks is difficult, outlines can indeed record dynamics of evolutionary change (e.g. González-José et al., 2008; Polly, 2001; Smith & Hendricks, 2013). This being said, in Palaeolithic archaeology, shape is routinely mobilised as a significant trait for inferring cultural evolutionary processes such as change over time and group formation. Whole-outline morphometrics and landmark-based approaches have rarely been compared explicitly, however, so the relative performance of the two approaches remains poorly understood.

To address this key methodological issue, we here compare the results of prior landmark-based analyses of stone artefacts with the outcomes of geometric morphometric outline approaches. This benchmarking exercise begins with the re-analysis of Buchanan et al.’s (2020) study that used landmark data to resolve the question of a potential taxonomic division within Paleoindian projectile points. We then turn to the study of Petřík et al. (2018) who employed landmark-based GMM and hierarchical cluster analysis to investigate shape variation in Czech Bell Beaker points. Together, the methodological comparison of these two case studies facilitates the exploration of whole-outline GMM and its ability to address nested hierarchical classification. Building on these insights, we then apply our newly developed protocol to our third case, a substantial dataset of bifacially worked arrowheads from North-West France, Southern England and Denmark, dating to between 4500 and 3700 cal BP (Nicolas, 2016). These arrowheads have previously only been examined from a traditional techno-typological perspective, but their regionality and chronology are well-established. This allows us to interpret our hierarchical classification methods in the light of cultural transmission dynamics. Equipped with this novel workflow, we finally turn to a sample of Final Palaeolithic large tanged points from central and eastern Europe in an effort to re-address their traditional classification and understand the cultural evolutionary dynamics within the design space framed by the larger assemblage of projectile points analysed here.

Outline Extraction and Transformation

For the extraction of outlines of multiple artefacts from either photographs or drawings, we developed an efficient workflow in the programming language R (R Core Team, 2020) based on the packages EBImage (Pau et al., 2010), imager (Barthelme et al., 2020) and Momocs (Bonhomme et al., 2014). If the quality standards of the original images are met and if there is no other information present on the image than the artefacts — such as artefact labels and scales — this workflow enables the fully automatic extraction of their outlines (Matzig, 2021). If heterogeneities exist in the original image (poor image quality, low contrast between artefact and background, shadow on photographs, etc.), they have to be subjected to minor manual preparation such as colour correction and manual thresholding/binarisation. The workflow is especially efficient for the extraction of individual shape information from images containing more than a single artefact, as it is the case for most archaeological catalogues and plate-rich publications. Each artefact in such a (binary) image is extracted using the EBImage::bwlabel() function and then automatically assigned a running number in addition to the identifier of the original table. Each extracted artefact is subsequently saved in JPEG while preserving the object’s original size and position. If deemed relevant, this allows for retrieving the artefacts original measurements using a scaling factor from pixels to centimetres. With each individual artefact binarised, extracted, and archived, the images are ready for further processing in Momocs, where the function Momocs::import_jpg1() generates outline coordinates for each artefact. The workflow described here can easily be reproduced using the wrapper package outlineR (Matzig, 2021).

All outline datasets examined in this study were subjected to the same basic outline processing steps in R, also using the Momocs package. The prepared outlines were first centred and scaled. Then, EFA; was applied. In EFA, the shape contour is described in what is called harmonics — terms of sine and cosine curves of successive frequencies; we are here using the number of harmonics that describe 99.9% of the harmonic power. The data were then transferred into a PCA and used for further analyses, as described in detail below.

Protocol Development: Materials and Analytical Methods

This section introduces the three successive case studies from which outline data were extracted in order to replicate the original landmark-based or typological studies. We briefly present the context of each study, the main body of lithic data and our chosen approach, including the methods used to reproduce the original findings, to further explore the shape data and to test, and discuss and potentially expand upon previous findings and conclusions.

North American Late Paleoindian Bifacial Points

Using landmark-based GMM, Buchanan et al. (2020) addressed the relationship between two Late Paleoindian (ca. 12,700–11,400 cal BP) projectile point types from the Central United States: the so-called Goshen and Plainview variants (Fig. 1, 5–8). Due to an ambiguous chronology and overall similarity of these two point types, it continues to be discussed whether the two groups of artefacts should be kept separate or whether they should instead be consolidated into a single artefact group — a paradigmatic example of the splitter/lumper controversy (cf. Simpson, 1945). By applying landmark-based geometric morphometrics and statistical testing to a sample of 17 Goshen and 139 Plainview type points, the results of Buchanan et al. (2020) suggest that the two groups are difficult to separate and that Goshen should probably be incorporated into the Plainview group.

We tested whether the key results of Buchanan et al. (2020) can be reproduced using the outlines of the target artefacts, extracted from the original object photographs which were kindly provided by the authors of the original study. All outlines were processed and analysed as described above. The retrieved outline data, together with their assigned types, were then subjected to a permutational MANOVA, also known as PERMANOVA or nonparametric MANOVA, using the vegan::adonis()-function (Dixon, 2003). Pairwise distances within the function were calculated using Euclidean distance.

We further conducted a hold-one-out cross-validated linear discriminant analysis (LDA) using the MASS::lda()-function (Venables & Ripley, 2002) inside the caret environment (Kuhn et al., 2020) with the same data as for the PERMANOVA. The method’s appropriateness for GMM analyses has been debated in the literature (Cooke & Terhune, 2015), but using the principal component (PC) scores for the calculation of the LDA heeds the assumption of a homogeneous covariance structure among groups. For the LDA, we first split the data into a training and test dataset with a ratio of 75:25%, respectively. Because Buchanan et al.’s (2020) sample sizes for the two groups are imbalanced, the training and test datasets were created using a stratified sampling approach, using the splitstackshape package (Mahto, 2019) in order to ensure that the type ratios were retained in the two subsamples. This training dataset was then used to initiate the cross-validated LDA model, while the test dataset was used to assess the performance of the said model. Furthermore, we subjected the complete dataset to a measure of relative shape distances. Proposed by Klingenberg and Monteiro (2005, 685), this technique provides ‘a scalar measure of the relative extent of shape differences, while taking into account that variation may not be isotropic’. Closely related to the one-sample version of what is referred to as Mahalanobis distance, this measure of asymmetry is designed to detect subtle shape differences that cannot be readily isolated in, for instance, PCA plots (cf. Klingenberg & Monteiro, 2005). Moreover, because this distance measure is scalar, it is ameanable to univariate statistical analyses (Klingenberg & Monteiro, 2005). Therefore, we transformed our data according to the workflow described by Klingenberg and Monteiro (2005) to arrive at a measure of relative shape distances and assessed these via the Mann–Whitney U test.

Czech Bell Beaker Arrowheads

In order to assess whether our protocol is able to not only discriminate between two typologically pre-classified groups but also derive new, meaningful clusters of artefacts, we turned to the dataset of Czech Bell Beaker projectile points examined by Petřík et al. (2018; cf. Fig. 1, 11–14) who investigated the contribution of spatial distribution, raw material availability and curation intensity on shape and size variation among these lithic arrowheads. In a dataset of 194 projectile points from Moravia, Petřík et al. (2018) discriminated between several clusters of arrowhead shapes using landmark-based GMM and hierarchical clustering methods. Our aim, again, was to evaluate whether and to what extent our outline-based GMM protocol is capable of replicating those clusters.

The authors kindly provided us with the drawings of the respective stone artefacts, which were then processed following the workflow described above. Incomplete artefacts were removed from the sample, which reduced our outline dataset to 189 lithic artefacts. The first five PCs, describing 95% of the total shape variation, were retained. Petřík et al. (2018) similarly conducted a PCA on their landmark data and retained the first five PCs, which notably only explained 75% of the observed variability. Similar to Petřík et al. (2018), we then clustered the artefacts into groups based on a pairwise Euclidean distance matrix derived from the outline data’s PC scores, employing the unweighted pair group method with arithmetic mean (UPGMA) algorithm (Sokal & Michener, 1958) in the base-R function hclust(). Following Petřík et al. (2018), we did not remove any outliers and cut the resulting tree in such a way as to also arrive at ten clusters. We then calculated the mean shape of the artefacts contained in each cluster for visualisation and direct comparison with the mean shapes retrieved in the original study.

The application of (phenetic) clustering methods, including the UPGMA algorithm, has a long history in archaeology (e.g. Clarke, 1968), not least in the Final Palaeolithic (e.g. Burdukiewicz, 1986; Schild et al., 1999). The number of clustering algorithms available is large, however, and explicit motivations or theoretical rationales for favouring one algorithm over another are rarely presented. Petřík et al.’s (2018) study is no exception here. Although hierarchical clustering as an unsupervised classification method generally suits the aims of this paper in so far as it represents a robust tool for an exploratory visualisation of group memberships on different hierarchical levels, we follow Nakoinz (2013) in arguing that the choice of algorithm should always be motivated conceptually and/or theoretically. Our analytical aims align with ‘population thinking’ rather than essentialist typological preoccupations, rendering some clustering algorithms more appropriate than others (Table 1). With reference to now classic debates on essentialism and typological thinking on the one hand and materialism and population thinking on the other (Hull, 1965; Mayr, 1959; Sober, 1980), the pioneers of evolutionary archaeology have long argued against classification approaches that assume pre-existing mental templates or ideal forms (see, for instance, O’Brien & Lyman, 2002). More contemporary writing in the philosophy of biology is critically revisiting and correcting the original debate (Witteveen, 2015, 2018), however, and archaeological rhetoric should likely follow on towards a more nuanced epistemological understanding of variation and its generative processes (Tostevin, 2019).

Table 1 Selected phenetic clustering methods and their implications with regard to cultural evolutionary applications

Northwestern European Late Neolithic/Early Bronze Age Arrowheads

In order to extend our protocol to hierarchical relationships between objects and the clusters they form, we turned towards a sample of Northwestern European (i.e. North-West France, Southern England and Denmark) bifacially worked arrowheads published and examined by Nicolas (2016; cf. Fig. 1, 9–10), dating to the period of 4500–3700 cal BP. The original study contains detailed projectile point images, whose geographic provenance and chronological position are sufficiently constrained, and contains a techno-typological classification of these arrowheads projected on the period’s chronology, which is based on metric measurements and morphological observations. To investigate whether the coupled application of whole-outline GMM and tree-building methods can retrieve the cultural–historical sequence proposed by Nicolas (2016), we first extracted the outlines of 738 lithic arrowheads following our data capture protocol. Before we applied hierarchical clustering (Ward’s method) to the pairwise distance matrix (Euclidean distance) from all PC scores of the retrieved outline shapes, extreme outliers were removed (Online Resource Figs. 8 and 9). We determined the number of clusters k on the basis of the average silhouette value (Rousseeuw, 1987) calculated using the NbClust package (Charrad et al., 2014). The silhouette coefficient measures the average distance based on the observations of each point within a single own cluster compared to each point in the other clusters. The coefficient can thus be calculated for each observation in the whole dataset and averaged for each number of possible clusters. The highest average silhouette value indicates the optimal number of clusters k.

In addition to the hierarchical clustering approach based on Ward’s method, we also applied the NJ algorithm (Saitou & Nei, 1987; Studier & Keppler, 1988), another distance-based clustering method. This algorithm clusters, in each iterative step, the pair of taxa that lead to the largest decrease of the overall tree length. In contrast to the approaches discussed above, NJ approximates the minimum evolution method (Felsenstein, 2004) and thus does not produce ultrametric trees, meaning that different rates of (trait) evolution are allowed for each taxa. The branch lengths of the tree can then be interpreted as the taxa’s rate of evolution. Conversely, under the assumption of a constant rate of evolution in each taxa, the branch lengths correspond to the temporal distance from their evolutionary origin. However, due to the use of a pairwise distance matrix, this approach remains purely phenetic, as information about character state changes is unfortunately lost. We calculated the NJ trees using the phangorn package (Schliep, 2010) bootstrapped with 10,000 iterations on the same Euclidean distance matrix as used for the hierarchical clustering before, albeit restricting our analysis to the region of North-West France, which represents the largest and most densely sampled geographic area within this dataset.

Final Palaeolithic/Earliest Mesolithic Tanged Points

Finally, we turn to a previously collated dataset of Final Palaeolithic and earliest Mesolithic large tanged points from northern and eastern Europe. These data were assembled by Riede et al. (2019) and Ivanovaitė et al. (2020; cf. Fig. 1, 1–4), and their research historical context has been discussed in detail in these publications. Here, we have re-sampled this dataset in such a way as to retain only lithic artefacts of a defined named archaeological type (NAT) represented by more than three outlines. This resulted in the recovery of 253 complete large tanged point outlines from ten modern-day countries (Belarus, Belgium, Denmark, Germany, Lithuania, Poland, Russia, Sweden, the Netherlands and Ukraine) attributed to seven NATs traditionally used as cultural taxonomic diagnostics (Table 2, see Online Resource Table 1 and 2 for counts of NATs per chronozone and NATs per country, respectively).

Table 2 Outline counts per NAT that were retained for this study’s analysis, their chronozone placement and approximate absolute chronology. NAT data derived from the Final Palaeolithic/earliest Mesolithic tanged point dataset published by Riede et al. (2019)

The outlines of the dataset were processed and subsequently clustered in the same way as described in the case studies before, using the Euclidean distance matrix derived from all PC scores and Ward’s method. The average silhouette value was used to determine the optimal number of clusters. As for the Late Neolithic/Early Bronze Age arrowheads, we furthermore inferred a phenogram using the NJ algorithm described above.

To assess the overall configuration of the design space occupied by Final Palaeolithic/earliest Mesolithic large tanged points, we then compared the retrieved point outlines with the shapes of the previously sampled North American Paleoindian, Northwestern European Late Neolithic/Early Bronze Age and Czech Bell Beaker lithic points. For the Late Neolithic/Early Bronze Age and Bell Beaker point data from Northwestern Europe and the Czech Republic, we drew a stratified subsample by region (using the splitstackshape package of Mahto, 2019) to yield, where possible, the same number of outlines as for the large tanged point dataset, and from thereon treated them as a single sample of Late Neolithic/Early Bronze points. We subjected this outline datasets to a combined PCA, again following the procedures described above and removing any identified outliers. Using all PC scores from this combined dataset, we calculated the shape disparity of each of the three samples. Disparity is defined here as the amount of total morphological variation, measured as the sum of variances within each of the dataset included in the combined PCA. For this, we used the dispRity package (Guillerme, 2018), bootstrapping the PCA data 1000 times. To test for statistical differences between the disparity values obtained for each of the three datasets, we conducted a pairwise Wilcoxon signed-rank test with Bonferroni correction to decrease the likelihood of Type I errors for multiple testing.

Protocol Evaluation: Results

In this section, we present the results of our whole-outline 2D analysis of three selected lithic point samples (Paleoindian Goshen and Plainview points, Czech Bell Beaker and north-western Late Neolithic/Early Bronze Age arrowheads) and compare them with the conclusions of their original studies in order to assess the performance of our approach in reproducing the results of landmark-based approaches or traditional typological reasoning, respectively. While we recognise that different approaches to describing artefact form, shape, and geometry necessarily come with different sets of advantages and disadvantages and hence complement each other in varying ways, the purpose here is to evaluate whether and to what extent outline-based geometric morphometric methods offer a promising, feasible, effective and efficient way to capture the variability of artefact shape of particular stone artefact classes. Acknowledging the discussion about evolutionary information contained in outline data briefly mentioned in the beginning (i.e. Zelditch et al., 1995, 2001; MacLeod, 1999, 2001), we further test to what extent this approach is suited to infer cultural evolutionary relationships using two-dimensional outline data.

Reproducing the Results of Buchanan et al. (2020)

In our whole-outline GMM analysis of Buchanan et al. (2020), the first two PCs can be interpreted to describe the artefacts’ width (PC1 = 61.3%) and symmetry (PC2 = 24.8%), respectively (Fig. 2). Except for a few outliers, the two types show considerable overlap. While parametric tests such as the Mahalanobis distance between the group centroids (using the first six PCs, \({D}^{2}=1.06\)) or Hotelling’s generalised \({T}^{2}\)-statistic (\(p=0.021\), \({T}^{2}=0.104\)) would suggest statistically significant differences, they assume multivariate normal distribution patterns which is not supported by the data. In contrast, a nonparametric PERMANOVA test with 10,000 permutations resulted in \(p=0.338\) (\(F=1.071\)). Thus, \({H}_{0}\) (the means of all groups are equal) could not be rejected, indicating that the groups are statistically indistinguishable. Likewise, the distributions derived from the measure of relative shape distances (Klingenberg & Monteiro, 2005) subjected to a Mann–Whitney U test yielded a non-significant result (\(p=0.46\); \(W=1312\)), again suggesting that group means cannot be separated. For the results of the LDA, all Goshen-type points of the test dataset were classified as Plainview type. Here, all Plainview type specimens from the test dataset were classified correctly. In sum, our analysis confirms the results of (Buchanan et al., 2020) and simultaneously demonstrates that outline-based approaches are a robust alternative for capturing lithic artefact shape vis-à-vis landmark-based approaches. Having said this, it should be noted that the Goshen point sample is fairly limited, comprising only 17 specimens and therefore places the discriminative power of the statistical analyses on relatively weak grounds.

Fig. 1
figure 1

Examples of lithic artefacts subjected to 2D-GMM analysis. 1–4 Large tanged points from the European Final Palaeolithic; 5–8 North American Paleoindian points; 9–14 Late Neolithic/Early Bronze Age arrowheads from central and Northwestern Europe. 1 Large tanged point with minimal distal edge modification from Eskebjerg, Denmark (Fugl Petersen, 2001, Fig. 12, 1); 2 large tanged point with direct lateral retouch from Eskebjerg, Denmark (Fugl Petersen, 2001, Fig. 12, 5); 3 triangular tanged point with one-sided concave basal modification from Golßen, Germany (Winkler, 2010, Abb. 16: 18); 4 tanged point with minimal distal edge correction from Golßen, Germany (Winkler, 2010, Abb. 16: 15 after Geupel, 1987); 5 bifacial Plainview point with concave base from Ted Williamson, US (courtesy of B. Buchanan); 6 bifacial Goshen point with concave base from Hell Gap, US (courtesy of B. Buchanan); 7 bifacial Plainview point with straight base from Milnesand, US (courtesy of B. Buchanan); 8 bifacial Plainview point with straight base from Milnesand, US (courtesy of B. Buchanan); 9 bifacial triangular, winged-stemmed arrowhead from Tumulus de Cruguel, Morbihan, France (Nicolas, 2016, Planche 73: FR-58-08); 10 bifacial arrowhead with concave base from Blære, Års, Denmark (Nicolas, 2016, Planche 170: DK-05-07); 11 bifacial arrowhead with concave base from Jezeřany-Maršovice, Czech Republic (Olivík,, 2009, Tab. 38: 6) after Kopacz et al., 2009); 12 bifacial arrowhead with concave base from Jiřikovice, Czech Republic (Olivík, 2009 Tab. 38: 7) after Kopacz et al., 2009); 13 arrowhead with heavy, circumferential edge retouch from Ledce II, Czech Republic (Olivík, 2009, Tab. 38: 8) after Kopacz et al., 2009); 14 arrowhead with heavy, circumferential edge retouch from Ledce II, Czech Republic (Olivík, 2009, Tab. 38: 9) after Kopacz et al., 2009). No scale

Fig. 2
figure 2

Scatterplot of the first two PC scores of Goshen and Plainview point outline data with their respective 95% confidence interval ellipses created with ggplot2 (Wickham, 2016)

Reproducing the Results of Petřík et al. (2018)

As shown in the original study by Petřík et al. (2018), the first two PCs of our replication study cover variability of point width and base shape, respectively. Yet, while PC1 in the original study only explained ~34% of the total shape variation, it explains 76% in our outline-based analysis. Petřík et al.’s PC3 reflects object symmetry and explains 10% of variation. In our study, symmetry is instead represented by PC5 and explains merely 3% of the total variation. Both approaches thus identify the same general dimensions of shape variation, but the whole-outline method seems to accord greater importance to lithic object width and the relative convexity of the artefacts’ lateral edges.

Inspection of the PCA plot of the outline data suggests that PC1 explains such a large amount of variation due to four outliers which belong to Clusters 6 and 8 (Fig. 3). We suggest that those outliers are responsible for Petřík et al.’s (2018) scree plot, indicating a much lower optimal number of clusters (see Petřík et al., 2018, supplementary Fig. 3). Our cluster analysis using the UGPMA algorithm thus indicates that we are largely dealing with a single point type, characterised by a concave base and the absence of a tang and with variation in overall width and base concavity (see especially Online Resource Fig. 7). Wider points with a flatter base are grouped in Cluster 1, whereas the narrower points with more pronounced concave bases fall into Cluster 4. Clusters 2 and 3 represent stemmed points, whereas clusters 6, 8 and 9 seem to gather outliers, as do Clusters 7 and 10, which represent asymmetric points with a single wing on their right side. We followed Petřík et al. (2018) in dividing the dataset into ten clusters. In general, the mean shapes of the retrieved ten groups are comparable to those derived in the original study. While we cannot further quantify the degree of match between the results of the landmark-based and whole-outline GMM methods, they produce similar overarching trends: two main clusters surrounded by many outliers (see Petřík et al., 2018, Table 1). It is worth noting here, however, that the outline-based approach appears to be more sensitive to outliers, potentially leading to heavy distortions in the PCA.

Fig. 3
figure 3

The ten clusters generated from the whole-outline GMM re-analysis of Bell Beaker arrowheads from Petřík et al. (2018). A The generated mean shapes and the associated UPGMA dendrogram. B Scatterplot of the first two PC axes. C Individual clusters and their respective artefact shapes

Reproducing the Results of Nicolas (2016)

The average silhouette value (k = 9; Online Resource Fig. 12) derived from the point data of Nicolas (2016) results in the dendrogram shown in Fig. 4 which is based on Ward’s method. The shapes contained within the retrieved clusters closely mirror Nicolas’ (2016, Fig. 29) original typological divisions (Fig. 4; Online Resource Fig. 11). The types proposed by Nicolas (2016, Fig. 29) align along two axes, the first representing the shape of the arrowheads’ base (ranging from hollow to concave to barbed); and the second conveying the presence and form of the barbs (ranging from pointed to so-called slanted ones). Regarding the shape variation captured by the analysed outlines, PC1 (54%) represents a combination of axis one and axis two of the author’s original type-based classification: it mainly describes the arrowheads’ base and ranges from a highly concave base with rounded, downwards pointing barbs and straight lateral edges to arrowheads with a very long stem and long, slanted barbs with generally more curved lateral edges. PC2 (24%) yields information about the barb-length-to-tang-length ratio and the global width of the artefacts, whereas PC3 (9%) describes the overall barb-length-to-tang-length ratio, with artefact width only playing a minor role. PC4 (4%) captures the configuration of the base with an emphasis on the overall length of tang and barbs. The symmetry along the main tool axis, especially concerning the unevenness in barb length and the extent of the hafting indentation, is represented in PC5 (2.5%).

Fig. 4
figure 4

A Pruned dendrogram and derived mean shapes of the respective clusters based on Ward’s method using the outlines of the stone arrowheads published by Nicolas (2016). B PCA scatterplot of the first two PCs, coloured according to cluster attribution. C Overview of outlines contained in each cluster

Overall, the artefacts are best described as belonging to only two clusters — those with tanged versus those with concave bases — with a further subdivision of the tanged points into plump and less symmetrical arrowheads with short tang and barbs (Clusters 3 and 4) and more slender artefacts characterised additionally by their elaborated proximal part and high overall symmetry (Clusters 5, 6, 7, 8, 9). Importantly, this pattern aligns with the geographic distribution of the artefacts within the clusters (Fig. 5). While Clusters 1 and 2 almost exclusively appear in Denmark, Clusters 5, 6, 7 and 9 occur mainly in France. Clusters 3, 4 and 8, in contrast, are found both in France and Britain, whereas the few British arrowheads in Cluster 8 form a distinct subsample within this cluster.

Fig. 5
figure 5

Geographic distribution of the derived Late Neolithic/Early Bronze Age arrowhead clusters based on Ward’s method, excluding outliers. Colour coding is identical to Fig. 4

In addition to these well-defined geographic patterns, the phenogram inferred by the NJ algorithm (Fig. 6) structurally replicates the typo-chronology established in the original study. Assuming a constant rate of trait evolution (see above), the branch length of the X-axis reflects the temporal distance to the taxon’s ancestors. The tip labels are coloured according to archaeological sites and named according to their catalogue entries; the coloured column on the right represents the time period to which the corresponding arrowheads have been dated. Following this, the clade highlighted in yellow captures the transition from Bronze A1 Stage 1 over Bronze A2 Stage 2 to Bronze A2 Stage 3 (cf. Nicolas, 2017, Fig. 10). The clade coloured in light green captures the transition from Bronze Age A1 Stage 1 (close to the root of the tree) to Bronze A2 Stage 2 (with the longest branches). The clade between the clades highlighted in green and yellow exclusively contains artefacts from Bronze A1 Stage 1. The lower most clade containing mostly Bell Beaker arrowheads yields a more ambiguous pattern.

Fig. 6
figure 6

Phenogram based on the NJ algorithm using only the outlines of the artefacts from North-West France published by Nicolas (2016). The tip labels are coloured according to their respective sites and named after their catalogue entry; the coloured column on the right represents the time period into which they have been dated. Highlighted in yellow and green boxes are the clades that mirror the original study’s typo-chronology particularly well

Protocol Application: Final Palaeolithic/Earliest Mesolithic Tanged Points and the Evolution of Lithic Design Spaces

Having demonstrated our outline-based protocol’s ability of capturing the main features of artefact shape variability in ways comparable to previous typo-technological or landmark-based approaches, we now apply the protocol to the sample of large tanged points from the Final Palaeolithic and earliest Mesolithic of northern and eastern Europe assembled by Riede et al. (2019) and Ivanovaitė et al. (2020). The purpose of revisiting these data is to apply the 2D outline-based approach developed here to explore the design space and potential cultural evolutionary signatures of this artefactual Leitfossil in relation to the proposed cultural taxonomic schemes and to compare the resulting patterns with the other three datasets analysed here.

The shape space of tanged points in the dataset is mainly described by two first PCs. A total of 50% of the shape variance is described by PC1, reflecting symmetry along the main tool axis and the position of maximum width (Fig. 7). PC2 captures 27% of the shape variance and charts the relationship between the tang area and the distal part of the points as well as their slenderness. PC3 describes 6% of the recorded variance and tracks different distal edge configurations (straightness, convexity and convergence). Only 4% of the shape variance of tanged points is explained in terms of the relative length of the tang and the transversal curvature of the distal part. The overall shape space of these tanged points is relatively coherent with a dense centre and no extreme outliers (see Fig. 7A, C, E).

Fig. 7
figure 7

Shape space of the Final Palaeolithic/earliest Mesolithic tanged point dataset. A, C, E The scatterplots of PC axis 2, 3 and 4, respectively, combined with PC 1, colour-coded in accordance with the clusters in Fig. 8. The associated shape variation with each PC is depicted in D. B The proportion of variance explained by the respective PCs

Fig. 8
figure 8

A Pruned dendrogram (Ward’s method) based on the Final Palaeolithic/earliest Mesolithic sample of tanged points. The mean shapes of the clusters are placed at the tree’s tips, the individual artefacts contained in each cluster are depicted in C. B The average silhouette value for the dataset if partitioned into a different number of clusters

The cluster analysis of tanged point outlines using Ward’s method yields little evidence in favour of unambiguous and well-defined groupings. The comparatively short branch lengths of the dendrogram indicate strong overall similarity in the outline dataset and limited evidence for higher-order groupings (Fig. 8). Based on the derived silhouette plot (Fig. 8B), the local optimum of clusters is k = 10 and the artefact composition of each cluster is provided in Fig. 8C. The resulting lower-level groupings of tanged lithic points (Clusters 1–10) further exhibit pronounced internal shape variability. Intra-group shape variance is lower, however, than overall inter-group shape variance, suggesting that these groups reflect general shape tendencies within a single population or design space. This observation is further supported by the unstructured geographical distribution of the artefacts (Fig. 9). The only exception here are perhaps the distinctly shaped artefacts of Cluster 9 from Belarus and Russia.

Fig. 9
figure 9

Geographic distribution of the derived Late Palaeolithic/earliest Mesolithic tanged points dataset clusters based on Ward’s method, excluding outliers. Colour coding is identical to Fig. 8

Paralleling the application of the NJ algorithm to the French Late Neolithic/Early Bronze Age points (cf. Fig. 6), we seek cultural evolutionary signals in the Final Palaeolithic/earliest Mesolithic large tanged point sample. Based on the methodology developed here and the results obtained on our training samples, we expected that if a coherent cultural evolutionary signal is inherent in the shapes of the large tanged points, then these should cluster chronologically, regionally or both. Yet, the inferred large tanged point phenogram yields no obvious association between chronological position (relative chronozone) and their branch lengths, or by their geographic origin or prior cultural taxonomic assignment (Figs. 9 and 10), as was the case for the French Late Neolithic/Early Bronze Age arrowheads.

Fig. 10
figure 10

Inferred NJ phenogram of tanged points from the Final Palaeolithic and earliest Mesolithic of northern and eastern Europe. The tip label colour indicates the time period into which the artefacts have been dated, while the column on the right represents the NAT

In a final step and in an attempt to assess the significance and configuration of the global shape space constraints of our Final Palaeolithic/earliest Mesolithic tanged points, we compared their outlines with the outlines of the North American Late Paleoindian and the combined Northwestern European Late Neolithic/Early Bronze Age and Czech Bell Beaker point and arrowhead samples examined above in a single, combined PCA. The PCA analysis produces two distinct and well-separated groups (Fig. 11). Almost 90% of the total shape variance is explained by the first PC, however, most likely reflecting different overall design solutions. The extremes of PC1 are occupied by forms which can be described as oval at the one end and forms trending towards more triangular shapes on the other. Another key dimension of the shape space is the configuration of the base and the position of the maximal width which is represented in PC2 (6%). Merely describing 2% of the recorded shape variance, PC3 reflects the degree of symmetry, although do recall the substantial difference here between the highly symmetric Paleoindian and Late Neolithic/Early Bronze Age points and the strongly asymmetric Final Palaeolithic ones. PC4 (2%) captures slenderness and edge curvature. PC1 discriminates between the large tanged point sample at one extreme and the Paleoindian and Late Neolithic/Early Bronze Age points from on the other extreme. The Paleoindian points exhibit the greatest internal homogeneity across all PC axes suggestive of either strong functional constraints, strongly normative cultural transmission processes, or possibly a combination of these. The variance of Late Neolithic/Early Bronze Age arrowheads is greater and primarily explained by PC2. It is important to recall, however, that the variation within this sample was highly structured with nested clusters (Fig. 4). The Final Palaeolithic/earliest Mesolithic large tanged points, in contrast, show a PC variance comparable to the combined Late Neolithic/Early Bronze Age sample in the absence of geographical structure. These diverging trends are also reflected in the overall sum of variance (disparity) within each of the three samples (Late Paleoindian, Late Neolithic/Early Bronze Age and Final Palaeolithic/earliest Mesolithic large tanged points), yielding statistically significant results regarding their pairwise difference (pairwise Wilcoxon signed-rank test p < 0.001, Fig. 11F). The high disparity in the Late Neolithic/Early Bronze Age dataset is explained by its strong internal and also regional structure. When split into regions, the region’s internal sum of variance becomes even lower and approaches the variance value calculated for the tanged point dataset (Online Resource Fig. 13F).

Fig. 11
figure 11

Combined shape space of all of the datasets studied. A, C, E The scatterplots of PC axis 2, 3 and 4, respectively, combined with PC 1. D The shape variation associated with each PC. B The proportion of variance explained by the respective PCs. F Boxplots visualising the sum of variance (disparity) for each of the tool groups studied. In each panel, the colours represent the three populations under study, as labelled in F

It is also worth noting that the retrieved groupings in Fig. 11 are distinguished by the mode of tool production and retouch: the large tanged points are unifacial and often represent only locally retouched and edge-modified pieces, whereas both Paleoindian and Late Neolithic/Early Bronze Age points are bifacially modified. The position and spread of the Late Neolithic/Early Bronze Age arrowheads within the larger design space clearly shows that these forms are underpinned by the regulative idea of triangular and cordiform shapes and a marked trend from concave-winged to stemmed-winged proximal designs. The design of the proximal part also appears to be the distinguishing feature of the examined Paleoindian points, yet with an emphasis on relatively straight and horizontally undifferentiated basal configurations. The shape of the tanged points, by contrast, is defined by a more or less developed and elongated tang.

Discussion and conclusion

This study aimed at evaluating whether and to what extent outline-based geometric morphometric methods (GMM) are suited to capture the variability of artefact shape within stone artefacts from different chronological and cultural contexts in an efficient and robust way and whether and to what extent outline data are suited to reproduce previously established typo-chronological patterns. We have first benchmarked our 2D outline approach to lithic artefact shape analysis against three previously published datasets, two of which used landmark-based GMM. When compared to the landmark-based results of the Paleoindian (Buchanan et al., 2020) and Bell Beaker projectile point (Petřík et al., 2018) studies, our results demonstrate that an outline-based approach can reliably reproduce the primary dimensions of shape variability and the general structure of projectile point design space. We therefore conclude that the main advantage of outline-based approaches over landmark-based analysis, especially when integrated into semi-automated digital extraction protocols as developed here (Matzig, 2021), is the possibility of: (a) rapid data capture from published drawings or photographs; and (b) of circumventing the issue of defining homologous landmarks.

The case of Late Neolithic/Early Bronze Age arrowheads from Nicolas (2016) demonstrates that appropriate algorithm choice enables the retrieval of meaningful hierarchical clusters from artefact outlines which mirror previously and independently established archaeological types and so help to inform the interpretation of the geographic distribution of these artefacts. Moreover, and on the basis of a dated subsample of the same dataset, we were able to infer a phenogram that closely resembles the original typo-chronology elaborated by Nicolas (2016).

Based on these results, we were able to confidently turn towards the Final Palaeolithic/earliest Mesolithic large tanged points dataset, which has been used in several studies so far (cf. Riede et al., 2019; Ivanovaitė et al., 2020) which were concerned with the questioning of the validity of so-called NATs characteristic for this time period. Using the clustering approach tested in the Late Neolithic/Early Bronze Age case study before, the retrieved clusters of tanged points do neither converge with the NATs to which they are traditionally ascribed, nor does the geographical distribution of the clusters point towards any patterns of regionalisation. While keeping the skewed distribution of artefact numbers per NAT in this dataset in mind — as well as the compelling results obtained from the other datasets analysed above — we conclude that the lithic artefacts of the different NATs form a comparatively homogeneous group and that the documented variability likely reflects residual variability related to blank production and original blank morphologies. At least a part of the documented outline variability of the large tanged points, in other words, is probably unrelated to the tool production itself.

The disparity analysis puts the variability of the large tanged points dataset into further perspective. Despite covering the largest period of time and broadest total geographical distribution, the tanged point dataset from the Final Palaeolithic/earliest Mesolithic exhibits an internal sum of variance comparable to each of the separate Late Neolithic/Early Bronze Age arrowhead subsamples from the UK, France and Denmark which document pronounced spatiotemporal differences despite geographic adjacency. Furthermore, the inferred phenogram of the large tanged point dataset does not indicate any chronological pattern (judged by either their relative chronozone attribution or their position in their imputed archaeological type names which are, respectively, rather broad and fuzzy and have previously been contested; cf. Reynolds and Riede, 2019). In relation to confounding factors such as the low chronological resolution of the sample, imbalanced artefact selection and low number of analysed specimen as well as a limited number of NATs, these findings should be seen as a starting point for further analysis.

The results of our analysis demonstrate that quantitative outline approaches to artefact shape spaces can, in principle, help to inform our understanding of cultural evolutionary dynamics of lithic forms, as is, for example, demonstrated by our replication of Nicolas’ (2016) arrowhead-type sequence. Yet, we also caution that while overall shape may reflect the broader design space configurations of prehistoric projectile points including large tanged points and arrowheads, technological processes and attributes may yet be as informative or more so in terms of cultural transmission dynamics (e.g. Tostevin, 2019). Shape is only one dimension of point design, and understanding the morphological variability of pointed tools ultimately requires paying close attention to the interaction of shape-related variables with other key aspects of tool-making and tool design (cf. esp. Gowlett, 2006; Richter, 2013), such as object biography and allometry, blank production, manufacturing techniques, and the strategies and goals of tool modification and curation. In tool designs largely centred on unifacial blanks, the relationship between blank production and tool manufacture may partially pre-determine point variability (cf. Migal, 2007). There is a clear difference in the amount of immediate shape control afforded by tools manufactured on débitage blanks and those made on façonnage products, especially when secondary modification targets only a part of the blank matrix and therefore only makes a comparatively small contribution to the whole outline of the final tool form. In contrast to bifacial shaping which generally enables high levels of overall shape control, it must thus be expected that at least some residual shape variability of tools made on these blanks reflect the variability of forms supported by the underlying blank production technology. In other words, a large portion of the tool outline is not directly controlled by the modification stage of the operational sequence but instead indexes the relationship between lithic production technologies and blank selection. This recognition has significant implications for how to interpret differences in the degree of shape normalisation detected by outline-based approaches.

In comparison to the large tanged points, the post-Palaeolithic arrowheads from central and Northwestern Europe analysed here attest to a tool design converging on broadly cordiform and triangular shapes. The degree of base differentiation (e.g. winged or winged and stemmed) appears to be the main design feature. The North American Late Paleoindian points converge on broadly the same design space yet trend towards straight, non-differentiated proximal parts, thus achieving an integration of ‘prehensive’, ‘transmissive’, and ‘transformative’ tool parts (sensu Lepot, 1992; Albrecht & Müller-Beck, 1988; Boëda, 2013; Richter, 2013). This regulative idea, in turn, may be responsible for the comparatively low shape disparity recorded in this point sample, indicating well-established conventions of global tool morphology and possibly also elevated levels of tool production and modification normativity. The Final Palaeolithic/earliest Mesolithic tanged points from northern and eastern Europe populate the other half of the total design space as their oval form tendency with pronounced tangs is clearly different from the mainly triangular shape tendencies of the North American Paleoindian points and the European post-Palaeolithic arrowheads. This general segregation of the design space also becomes evident in the PCA scatterplot (cf. Fig. 11), suggesting that proximal structural differentiation is another discriminating factor together with the degree of symmetry and point lateralisation. Unsurprisingly, the design of the studied tanged points is clearly centred on a more or less pronounced and well-defined ‘receptive’ or ‘prehensive’ tool part — the tang —leaving the distal ‘transmissive’ and ‘transformative’ tool parts to vary strongly.

In thinking about the generative processes behind the differences in the design between Final Palaeolithic/earliest Mesolithic tanged points and later prehistoric arrowheads, it is important to not only consider different design imperatives but their larger archaeological contexts as well. The Czech Bell Beaker and Northwestern European Late Neolithic/Early Bronze Age arrowheads, for example, make their appearance within a socio-political context in which knapped stone technologies co-exist with pottery technologies, polished and ground stone technologies and early metal production (e.g. Sarauw, 2008). As a consequence, design imperatives and object norms are potentially at least in part also shaped the interplay with these other technologies, providing space for functional diversification and social status negotiations within a socio-political economy of object use and significance, including varying degrees of symbolic investment. Given that primary subsistence-related and economic roles may then be overtaken by other key technologies which potentially even outperform lithic tools in these respective functions, lithic tools can acquire potent sociocultural, ideological, and cosmological value or are relegated to specialised domains of human behaviour such as warfare, social stratification or territoriality (e.g. Dietrich et al., 2019; Gopher, 1994; Whittaker, 1994). Such intra-society dynamics create the cultural and technological space for a marked diversification of artefact morphotypes. Lifting the morpho-functional but also the labour constraints of lithic points may incentivise the exploration of the available and adjacent design space even further and therefore accelerate the evolution of artefact forms.

This situation is contrasted by the Final Palaeolithic/earliest Mesolithic points. The morphotype of the large tanged point (\(\ge\) 15 mm and >5 g), which in Europe continues to be treated by some scholars as a classic culture-historical Leitfossil of the Final Palaeolithic (ca. 15,000–11,000 cal BP), is found in the archaeological record of many regions and periods (e.g. Pesesse & Flas, 2012; Seong, 2008), suggesting a global design optimum on which lithic point design converges multiple times even in geographically unrelated areas (cf. O’Brien et al., 2018). One explanation for their lack of chronological and regional specificity may thus be sought in the role of these large tanged points as functional components of larger systems of hunting weaponry and their associated subsistence and mobility strategies. For example, the primary occurrence of such massive lithic points in higher latitudes may relate to their involvement in specialised yet relatively flexible weapon systems based on the basic design principle of ‘reliability’ (sensu Bleed, 1986). The primary functional constraint would have been their ability to penetrate thick-skinned animal species such as the European moose or Giant Elk (cf. Bokelmann, 1978; Tomka, 2013), while ballistic considerations generally suggest that large tanged points serving as dart tips should have had relatively lax constraints on size and weight (cf. Riede, 2009). The whole-shape variability recorded in our Final Palaeolithic/earliest Mesolithic tanged points dataset generally supports this interpretation, indicating that especially the distal part of these points and, thus their ‘business ends’ (Kuhn, 2020), allowed for considerable formal variability and that normalising this part of the tool was not a primary concern; design imperatives rather ensured their convergence or pointedness and not so much particular morphological or technical details connected to point tip and lateral edge configuration. The consistency of the tanged element, often prepared in more elaborate fashion, is arguably the central design feature of these points and in fact suggests that another source of constraint was at least equally important: hafting. The coherent elaboration of a structurally differentiated proximal part would support what Kuhn (2020, 46) terms ‘Mode B’ hafting — the strategy of custom-fitting a stone object to a durable handle or intermediate part made of another type of raw material (cf. Barham, 2013). The fact that various reduction approaches and modification tactics were apparently employed to achieve essentially the same outcome in the European Final Palaeolithic — a tang-like prehensive tool part — substantiates this conclusion, pointing towards hafting requirements as a key dimension of the overall design of these large tanged points.

It is salubrious to keep in mind that prehistoric stone artefacts are non-industrial products (Kuhn, 2020; Shea, 2017) and that all knapping has an unescapably experimental quality (Weißmüller, 1995, 2003) which leads to much natural variability or ‘noise’ in the data, especially if this variability is not deliberately removed by extensive secondary modification. This combination of limited functional constraints, influence of the mode of production and copying error in cultural transmission (cf. Eerkens & Lipo, 2007) as well as other processes such as horizontal transmission and exchange of weapons or weapon tips (Nishiaki, 2013; Wiessner, 1983) may have combined to obscure any cultural phylogenetic signal in this particular artefact class. Whether the signal is further compounded by a combination of low population density and the multifarious impacts of the Laacher See eruption (see Riede, 2017) is up to debate. An alternative interpretation of the absence of any cultural phylogenetic signal in this dataset — apart from the concerns about chronological resolution and sample size raised above — may thus be methodological: no cultural evolutionary information may be inherent in outline shapes because outline is only indirectly and often weakly controlled by potentially learned and transmitted tool-making behaviours.

Whatever the ultimate reasons, the internal sum of variance at least demonstrates that the assemblage of large tanged points studied here and falling into a time frame of approximately 2000–3000 years is characterised by considerably lower combined outline disparity than the post-Palaeolithic arrowheads with a much narrower spatiotemporal spread (Table 3). One possibility to interpret this difference is to regard the sample of Final Palaeolithic/earliest Mesolithic tanged points as an expression of slow-paced, ‘thick time’ (sensu Shryock et al., 2012) temporalities of change, resulting in much formal supra-regional homogeneity and markedly reduced rates of morphological evolution. Again, this might be linked to the firm integration of these large tanged points into modular weapon systems, while the present sample of post-Palaeolithic points may exhibit functional diversification and specialisation (cf. Nicolas, 2016) mediated by a sociocultural context in which tool status was highly dynamic and possibly responsive to social and economic stratification. This interpretation is also consistent with the data obtained from the Paleoindian sample, yielding the lowest total sum of variance of all studied datasets, even though its chronological and spatial coverage is substantial (Table 3). Paleoindian point designs are generally consistent with principles of ‘reliability’ as originally outlined by Bleed (1986), whereas European later prehistoric points seem to combine aspects of reliability with deliberate stylistic variation driven by concerns of status and identity — what Wiessner (1983) has influentially termed ‘emblematic style’. The Final Palaeolithic large tanged points exhibit both overall shape variability — design concerns apparently centre on the creation of structurally differentiated prehensive parts — and little in the way of internal spatiotemporal patterning suggesting generally expedient point designs and largely ‘embedded’ or ‘isochrestic’ style regimes (sensu Sackett, 1982). All of these cases reflect substantially different design solutions, and careful comparative whole-outline analysis is able to make them evident.

Table 3 Summary of the analysed datasets, sample characteristics and main results including calculated disparity values

In conclusion, our findings have important implications — both methodological and theoretical — for better understanding and explaining the relationship between cultural evolutionary processes and shape dynamics in lithic points and also for how to contribute to and advance the broader debate on style, function and design underwriting processes of stone tool manufacture and use in archaeology and beyond. 2D whole-outline analyses as developed here could, for instance, be used to test long-standing ideas about the changing status of shape-anchored lithic types through time (cf. Weißmüller, 2003; Valentin, 2011). Future studies may build on the methods applied here and, if possible, explore if and when rates of shape evolution experienced major changes and when and how design spaces were subjected to reconfiguration, ideally comparing multiple periods, transitional phases and regions. In palaeobiology, new computational methods for inferring phylogenies from continuous data are currently emerging (Parins-Fukuchi, 2017; 2018) and may yet offer a powerful extension of the methodology presented here. In particular, these approaches are adopting Bayesian inference methods (Wright, 2019; Warnock & Wright, 2020) and offer an integration with the downstream analytical tools of the so-called ‘comparative method’ (e.g. Adams & Collyer, 2019) which has seen only limited use in Palaeolithic archaeology (cf. Riede, 2011). Alternatively, a discrete character-based phylogenetic approach could be implemented and tested, systematically mapping a range of multivariate traits, such as tool outlines and contextual ecological variables.