Introduction

A number of recent papers have revived long-standing debates concerning the validity of the cultural taxonomies adopted by archaeologists (e.g. Riede et al. 2020; Reynolds and Riede 2019a; Sauer and Riede 2019; Ivanovaite et al. 2020). Although the foregoing papers focus on the Late Upper Palaeolithic of Europe, they form the latest instalment of a debate that is as old as archaeology itself and has at various points encompassed all periods and regions (e.g. Bishop and Clark 1967; Dunnell 1971; Clark and Lindly 1991; Bisson 2000; O'Brien and Lyman 2002; Shea 2014, 2020). Of particular relevance to the current paper is the fact that Shea (2019) notes a clear parallel between the problems identified by Reynolds and Riede (2019a) for the European Upper Palaeolithic and those encountered in the analysis of eastern African archaeological material (Will et al. 2019; Shea 2020).

Debates on the validity of cultural taxonomy have a long history in African archaeology (e.g. Goodwin and Van Riet Lowe 1929; Bishop and Clark 1967; Shea 2020), with the added complication that most early classificatory schemes involved a ‘bastardisation of European terminology’ (Goodwin and Van Riet Lowe 1929:97) that was poorly suited to the African evidence. Indeed, Goodwin (1958:33) reflected that prior to the establishment of a purpose-built African terminology ‘we had been trying to describe giraffe in terms of camel, or eland in terms of elk’. The inadequacy of European terminology prompted the establishment of a bipartite division of (southern) African material into Earlier and Later Stone Ages, ratified at the 24th annual meeting of the South African Association for the Advancement of Science in Pretoria, 1926 (Goodwin 1926). Continuing research by Goodwin and Van Riet Lowe (Goodwin 1928; Goodwin and Van Riet Lowe 1929) soon led them to recognize that the inclusion of a third period—the Middle Stone Age—was ‘essential to cover the facts observed’ (Goodwin 1946:74). This tripartite division subsequently became the norm throughout sub-Saharan Africa.

From the outset, it was recognized that there were varied regional and chronological facies within each of the major ‘Ages’, that certain industries could be regarded as transitional between them, and that the differences between them were quantitative rather than qualitative. Goodwin (1946:74) was careful to note that ‘the three periods overlap to some extent… we only reach each new “Age” as the new technique becomes dominant’; delegates at the Third Pan-African Congress on Prehistory in 1955 resolved that ‘more elasticity’ was required in the use of the three Ages (Cole 1955:204) and adopted two intermediate stages between them (Cole 1955; Clark 1957). As the three Ages came to be used over a greater extent of the continent, it became clear that they could not be used as chronological markers and that ‘time connotations must be separated from cultural concepts’ (Bishop and Clark 1967:866); transitions from one Age to another could be protracted and did not occur simultaneously—nor even necessarily follow the same trajectories—in different geographical areas (Scerri et al. 2021).

The Burg Wartenstein symposium of 1965 was highly critical of the typology developed by Goodwin and Van Riet Lowe, and even more critical of its rather lax subsequent use; indeed, Isaac’s proposal that ‘the terms “Earlier”, “Middle”, and “Later” Stone Age in Africa should be abolished for all formal usage’ was agreed unanimously (Bishop and Clark 1967:867). Kleindienst (1967) noted that publications of indicative assemblages with adequate descriptions were scarce and that as such prehistorians had tended to use ‘the same terms but with different definitions, different connotations, and at different levels of abstraction’ (Kleindienst 1967:828); her extensive lexicon makes such problems abundantly clear. Although the Burg Wartenstein delegates advocated the abandonment of the three Ages, the major issue for archaeologists over the subsequent decades was that no concrete suggestions had been made as to what would replace them (Sampson 1974; Parkington 1993; Underhill 2011). As such, the three Ages have retained their dominance over the African record; Clark’s (1969) mode system provides a useful accompaniment, and Shea’s (2020) EAST typology may yet have significant impact, but of the assemblages analysed below, all are designated by their excavators as either Middle or Later Stone Age.

Goodwin and Van Riet Lowe’s (1929) distinction between Middle and Later Stone Age implements rests upon differences in the preparation of the striking platform and in the nature of the resultant flakes. MSA flakes are marked by faceted striking platforms and convergent edges, unlike the flat striking platforms and parallel edges of the LSA. The essentially triangular MSA flake is therefore ‘eminently suitable for use as a point, and, indeed, the typical implement throughout the Middle Stone Age is the worked point in a variety of forms’ (Goodwin and Van Riet Lowe 1929:98). These basic distinctions persist; contemporary researchers stress decreases in prepared core technologies and retouched points together with increases in the production of backed pieces, prismatic blades and bladelets, and bipolar reduction as signalling the transition from the MSA to the LSA (Gossa et al. 2012; Pleurdeau et al. 2014; Masao 2015; Lahr and Foley 2016; Leplongeon et al. 2017; Shipton et al. 2018; Tryon 2019).

Although the primary distinction between MSA and LSA assemblages has been established on the basis of changes in lithic technology, increases in frequency of a number of other elements of material culture have also been aligned with this transition (e.g. Tryon 2019). Ground stone tools appear during transitional sequences at Mumba, Nasera, and Kisese II (Mehlman 1989; Tryon et al. 2018; Tryon 2019), while bone tools demonstrate erratic early appearances (e.g. Pante et al. 2020) before increasing in frequency during the LSA (e.g. Langley et al. 2016; Shipton et al. 2018). The use of ochre is associated with the earliest MSA at Olorgesailie (Brooks et al. 2018) but becomes widespread only in the late MSA and LSA (Tryon 2019; D'Errico et al. 2020). Finally, the appearance of disk beads made from ostrich eggshell may be a true marker of the transition, with the earliest examples found in eastern Africa around 50 ka at Mumba and Magubike (Gliganic et al. 2012; Miller & Willoughby 2014). The earliest examples of engraved ostrich eggshell in eastern Africa date to ~43 ka at Goda Buticha and are associated with an MSA industry (Assefa et al. 2018). The transition between MSA and LSA has therefore been identified across a range of material classes, but the ubiquity of stone tools, and their durability in the archaeological record, provides a robust means to examine change through time that is less impacted by patterns of selective preservation.

In eastern Africa, the MSA first appears ~300ka at Olorgesailie (Brooks et al. 2018) and persists until the end of MIS 3, ~30ka (e.g. Ossendorf et al. 2019); the eastern African LSA first appears ~67ka at Panga ya Saidi and persists into the Holocene (Shipton et al. 2018). The chronological overlap between the two industrial complexes is therefore substantial, and it should be noted that individual ‘LSA’ technologies are by no means absent from MSA assemblages (Blinkhorn and Grove 2018), whilst some important MSA technologies persist within the LSA (e.g. Ranhorn and Tryon 2018; Shipton et al. 2018). Mirroring Goodwin’s (1946) cautions concerning the lack of clear-cut divisions, Ranhorn and Tryon et al. (2018) suggest that proportional rather than categorical differences may be critical, while Grove and Blinkhorn (2020) find that the use of co-occurring constellations of technologies rather than individual fossiles directeurs allows for robust discrimination between industrial complexes. Using machine learning algorithms, Grove and Blinkhorn (2020) demonstrate that the co-occurrence of Levallois flakes, retouched points, core tools, and scrapers is indicative of the MSA, whilst an alternative constellation of blades, backed pieces, and bipolar reduction signals the LSA.

The three Ages thus remain dominant but disputed, and as Robertshaw (1990:8) wryly notes, discussions of typology and nomenclature in African archaeology have ‘often generated a great deal of heat but very little light’. The analyses reported below statistically test the validity of the division between assemblages labelled MSA and LSA using a combination of weighted binary logistic regression and permutation analysis. Whilst the primary aim is to assess the integrity of this particular division within the Stone Age of eastern Africa, the subsidiary aim is to provide a blueprint for the kind of analysis that might be used to test cultural taxonomic integrity in other periods and regions.

Methods

Data

The archaeological database used is that documented in Grove and Blinkhorn (2020), with the exception that the putative LSA assemblage from Nasera levels 4 and 5 is omitted. In the neural network study of Grove and Blinkhorn (2020) that sought to distinguish between LSA and MSA assemblages in eastern Africa, Nasera 4/5 was the only assemblage misclassified. Recent radiocarbon dates on ostrich egg shell beads obtained from stratigraphic positions above and below this assemblage by Ranhorn and Tryon et al. (2018) suggest that it is somewhat older than originally suggested by Mehlman (1989), and whilst chronological age is certainly not a valid proxy for industrial affiliation, both Ranhorn and Tryon et al. (2018) and Grove and Blinkhorn (2020) argue that this assemblage’s LSA status is questionable. Further to this, when employing the typology used by Grove and Blinkhorn (2020), this assemblage is identical to Mumba UV 38, which is unequivocally MSA. The database employed below thus consists of 91 assemblages (LSA n = 30; MSA n = 61) evaluated on the basis of the presence or absence of 16 technologies (see Grove and Blinkhorn 2020 and Supplementary Materials for further details).

The 16 technologies used in the database were Backed Pieces, Bipolar Technology, Blade Technology, Borer, Burin, Centripetal Technology, Core Tool, Denticulate, Levallois Blade Technology, Levallois Flake Technology, Levallois Point Technology, Notch, Platform Core, Point Technology, RT Bifacial, and Scraper. These technologies were chosen following a comprehensive search of the literature and were amalgamated from various synonymous terms used in the literature by previous authors. The terms employed encompass the full breadth of terminology used to describe stone tool assemblages for Late Pleistocene eastern Africa. Although previous researchers have in some cases employed different designations to refer to indistinguishable artefact forms (e.g. radial core as opposed to discoidal core), the amalgamation of such terms into a reduced taxonomy of 16 technologies goes a long way towards obviating this problem. The database utilizes existing classifications employed by the researchers who excavated or analysed a given assemblage; this is the case both for the designation of technocomplexes (i.e. ‘MSA’ or ‘LSA’) and for distinctions between multiple assemblages from the same site (e.g. Panga ya Saidi 5 or Panga ya Saidi 6). Although there may be differences in excavation and analytical techniques that lead to different concepts of what constitutes a distinct assemblage, the designations provided in the published literature provide the logical starting point for any subsequent analysis. Further details, including a comprehensive breakdown of synonymous terms, can be found in the Supplementary Materials. The locations of the assemblages used in the analyses are shown in Fig. 1.

Fig. 1
figure 1

The distribution of archaeological sites from which analysed assemblages derive, plotted on an SRTM (1 arc-second) DEM obtained from USGS earth explorer (https://earthexplorer.usgs.gov)

Whilst the analyses presented below focus on differences in the technologies comprising LSA and MSA assemblages, it should be noted that differences in artefact size and in raw material use have also been suggested to distinguish between these two industrial complexes. Decreases in artefact size from the MSA to the LSA have been previously noted (e.g. Leakey et al. 1972; Eren et al. 2013; Tryon and Faith 2016; Shipton et al. 2018), with Pargeter and Shea (2019) stressing the significance of miniaturization as a trend through time. Decreases in artefact size have also been identified in sequences at individual sites (e.g. Shipton et al. 2018). In terms of raw material use, an increasing focus on more fine-grained materials and those that appear in smaller clast sizes has been documented (e.g. Leakey et al. 1972; Shipton et al. 2018). Nonetheless, the analyses of Grove and Blinkhorn (2020) demonstrate that technological shifts alone afford considerable discriminatory power; as the analyses presented below rely only on technological differentiation, they can be regarded as conservative in terms of their assessment of the integrity of the LSA/MSA division.

The basic hypothesis to be tested is that the division of these 91 assemblages into two classes—labelled LSA and MSA—is statistically valid in the sense that a model that distinguishes between these two classes can be obtained with a deviance lower than that obtained via alternative divisions of the data. This hypothesis can in fact be formulated and tested in both weak and strong forms. The weak form employs a standard p-value, such that validity is claimed when, for example, the probability of obtaining a deviance as low as that obtained for the LSA/MSA division in a permutation sample is less than one in twenty (equating to α = 0.05). The null hypothesis in this case is that the LSA/MSA division is invalid because it leads to a model deviance that could have occurred at random with a relatively high probability. The strong form of the hypothesis is that the LSA/MSA division is valid in that it leads to a model deviance that is lower than that obtained via any other possible division of the data. In this case, the null hypothesis is simply that the LSA/MSA division is invalid because it is not the single best division of the data.

General statistical practice regards the strong hypothesis as overly conservative, and there are computational obstacles to testing this hypothesis precisely as stated. An exact permutation test would be required to assess the deviance of every other possible division in the data, and this is computationally intractable (including the split into LSA and MSA, there are 291 ≈ 2.48 × 1027 possible divisions of the data). This is a common problem for permutation analyses, however, and a truly random sample of several thousand permutations is normally regarded as sufficient (e.g. Ernst 2004). Since both weak and strong hypotheses can be assessed via the same permutation test (using alpha values of 0.05 and 1/(1 + the number of permutations), respectively), the results of both are discussed below.

Analyses

Weighted Binary log-F Regression

The analyses are built on the foundation of weighted binary logistic regression (henceforth, WBLR), a common statistical method for studying differences between classes. Weightings are applied both to account for differences in sample size between groups and—in the permutation study below—to accommodate the fact that there are several sub-groups of assemblages that appear identical under the typological scheme employed here. The basic weighting scheme ensures that the sums of weights for the two groups are equal; the weight for each assemblage in a given group, wg, is given as \( {w}_g=\frac{N}{2}{n}_g^{-1} \), where ng is the number of assemblages in that group and N is the total number of assemblages in the analysis. This ensures that ∑w1 =  ∑ w2 and that ∑w1 +  ∑ w2 = N. For the example of LSA and MSA classification, LSA assemblages assigned the weight \( {w}_{LSA}=\frac{91}{2}{30}^{-1}\approx 1.517 \) and MSA assemblages the weight \( {w}_{MSA}=\frac{91}{2}{61}^{-1}\approx 0.746 \).

An additional weighting scheme is developed to account for the fact that certain groups of assemblages are identical under this typological scheme (see Table 1). Whilst this is not a problem for the basic WBLR analysis—and the use of this weighting scheme makes no difference to the results of that analysis—it is problematic for the permutation analysis that follows; for consistency, this additional weighting scheme is therefore employed throughout. Where a sub-group of assemblages are identical, that assemblage type is entered only once into the analyses, with a weighting that reflects the number of assemblages of that type. For example, Kisese II levels 3, 7, 9, 10, and 11 are identical; this assemblage type is entered once, with a weight five times that of a single LSA assemblage. Table 1 demonstrates that the 91 assemblages in the analysis fall into 65 types (20 LSA and 45 MSA); the table also shows a breakdown of the weighting scheme for these 65 types as used in the initial analysis.

Table 1 Assemblages by group number with binary logistic regression weights

Initial inspection of the data matrix and preliminary standard logistic regression runs demonstrated that the results suffer from a phenomenon known as ‘separation’ (Albert and Anderson 1984) or ‘monotone likelihood’ (Bryson and Johnson 1981). This is an instance of sparse data bias (Greenland et al. 2016) in which, although the likelihood appears to converge, the coefficients do not; it is immediately signalled by the presence of one or more coefficients that are effectively infinite (i.e. with an absolute value limited only by the number of iterations permitted by the analyst when minimizing the negative log-likelihood). In the current dataset, separation is caused primarily by the presence of categorical covariates with either very high or very low prevalence (i.e. tool forms that exist either in most assemblages or in very few assemblages). The output of standard logistic regression models under separation is essentially meaningless.

The issue of separation has been widely noted by statisticians (e.g. Bryson and Johnson 1981; Albert and Anderson 1984; Lesaffre and Albert 1989; Kolassa 1997; Heinze and Schemper 2002; Greenland et al. 2016; Mansournia et al. 2018), and a number of solutions have been suggested. Most of these focus on the concept of penalized logistic regression—a form of shrinkage estimation using weakly informative priors—and many derive from the initial work of Firth (19921993). Though Firth’s (1993) method has been reasonably widely adopted, it has been criticized on the basis that it artificially shrinks the constant, clouds the interpretation of coefficients and odds ratios, and fails to account for possible correlations in the prior (Gelman et al. 2008; Greenland and Mansournia 2015; Rahman and Sultana 2017). The analyses below therefore employ the log-F method proposed by Greenland and Mansournia (2015), using a weakly informative prior proportional to the log of the F-distribution. This method displays all the benefits of Firth’s method whilst minimizing bias; crucially, it does not include the constant in the calculation of the penalty term (Greenland and Mansournia 2015; Rahman and Sultana 2017; Mansournia et al. 2018).

Formally, the penalized log-likelihood function to be minimized in log-F regression is

$$ PL\left(\beta \right)=-L\left(\beta \right)-P\left(\beta \right) $$
(1)

where β is the vector of coefficients (including the constant as the last coefficient). L(β) is the standard negative weighted log-likelihood,

$$ -L\left(\beta \right)=-\sum \limits_i{w}_i{y}_i\ln \left({\pi}_i\right)+{w}_i\left(1-{y}_i\right)\ln \left(1-{\pi}_i\right) $$
(2)

where w are the weights, y are the values of the dependent variable (0 for LSA or 1 for MSA), and π are the estimates of the dependent variable produced by the model with coefficients β. P(β) is the log-F penalty term given by

$$ P\left(\beta \right)=\sum \limits_{j=1}^{n-1}\frac{m{\beta}_j}{2}-m\ln \left(1+{e}^{\beta_j}\right) $$
(3)

where n is the number of coefficients in the model (i.e. the length of the vector β) and m gives the degrees of freedom of the prior. Following the recommendations of Greenland and Mansournia (Greenland and Mansournia 2015; see also Rahman and Sultana 2017), here m = 1. Note that the penalty is not applied to the nth coefficient (the constant term), as penalizing the constant can introduce exactly the form of bias for which Firth regression has been criticized (Greenland and Mansournia 2015). It is possible to carry out log-F regression via data augmentation for individual analyses (e.g. Discacciati et al. 2015); however, given the nature of the permutation tests described below, it is computationally more efficient in this case to directly minimize the result of Eq. (1).

The most important overall measure of fit for a logistic regression model is the deviance (= −2 × log-likelihood); for the initial regression model, the log-likelihood, penalized log-likelihood, sample-size corrected Akaike’s Information Criterion (AICc; Burnham et al. 2011) and the Cox and Snell, Nagelkerke, and count pseudo-R2 statistics are also reported. The Cox and Snell R2 (\( {R}_{\mathrm{CS}}^2 \)) is appropriate as, like the deviance, it assesses the fit of the full model relative to that of the null (intercept only) model (Maddala 1983; Cox and Snell 1989). The \( {R}_{C\&S}^2 \), however, has a maximum attainable value of less than one; Nagelkerke’s (1991) correction (\( {R}_{\mathrm{Nag}}^2 \)) re-scales it by the null model likelihood to give it a range of possible values between zero and one. The count R2 (\( {R}_{\mathrm{Co}}^2 \)) is simply the number of cases correctly classified divided by the total number of cases and is useful when assessing the classificatory ability of a model. For the initial regression model, the values of the individual coefficients and their likelihood ratio statistics are also reported; likelihood ratio statistics are preferred over the simpler Wald statistics as they are more reliable when dealing with small sample sizes (e.g. Agresti 2007:11ff.), particularly when dealing with the results of penalized logistic regression (Greenland et al. 2016). Whilst the production and assessment of regression coefficients is not the primary aim of this study, assessing the significance of the coefficients in relation to the results of Grove and Blinkhorn (2020) on significant predictors obtained via neural network analyses provides a useful comparison of these two methods.

Permutation Analysis

In order to assess the validity of LSA/MSA division, a permutation test was performed to compare this division to a random subset of other possible divisions of the data. Each permutation was carried out by randomly assigning the 65 assemblage types to two groups, performing a log-F WBLR on those two groups, and recording the resulting model deviance. Results of logistic regression can be imprecise and biased towards the larger group if the smaller group is too small; weighting goes some way to addressing this problem, but the fact remains that highly imbalanced sample sizes can lead to meaningless results. To ensure a range of sample sizes for the two groups (whilst ensuring that the size of neither group became trivially small), the sample size of the first group one was called from an integer-rounded probability distribution, with the size of the second group set equal to 65 minus the size of the first group. To minimize the possible effects of imbalanced sample sizes, two probability distributions were used in two different permutation exercises:

  1. 1.

    A triangular distribution with a minimum of 15, a maximum of 50, and a mean of 65/2

  2. 2.

    A uniform distribution with a minimum of 15 and a maximum of 50

If low deviances tend to occur more frequently in imbalanced models, 2 would be expected to produce a greater frequency of lower deviance results. This potential bias was further tested by examining correlations between the deviance of a model and the sample size of the smaller group; if greater sample size discrepancies between the two groups lead to lower log-F WBLR deviances, these correlations will be positive and significant.

Prior to each log-F WBLR, weights were adjusted such that the sums of weights for the two groups were equal to 91/2. Sample sizes for the two groups can therefore vary from 15 to 50, but sums of weights for the two groups remain identical at 91/2 in each permutation. The second weighting procedure described above (dividing the dataset into 65 weighted assemblage types rather than 91 individual assemblages) is particularly important to the results of the permutation test. Without this procedure, identical assemblages could be permuted into different groups, automatically increasing the deviance of the resulting log-F WBLR. Assessing model fits to types of assemblages rather than individual assemblages ensures the results regarding the integrity of the LSA/MSA split are as conservative as possible. Estimated p-values for the significance of the LSA/MSA split are given by

$$ \hat{p}=\frac{1+{\sum}_{i=1}^RI\left({d}_i\le {d}^{\ast}\right)}{1+R} $$
(4)

where R is the number or permutations of the data, di is the deviance of the log-F WBLR model fitted to the ith permutation, d is the deviance of the original log-F WBLR model, and I is an indicator function that equals 1 if di ≤ d and 0 otherwise (Grove and Pearson 2014). R was set to 99,999 permutations, yielding a minimum attainable p-value of 0.00001. Both the log-F WBLR and permutation procedures were written as custom scripts in Matlab R2019b (Mathworks, Natick, MA, USA) and are included as supplementary materials.

Results

Initial Analysis

The initial WBLR model had a deviance of 40.375 and was highly significant (relative to the null (intercept-only) model, χ2(16,65) = 96.471, p <.001, null deviance = 136.846). The AICc value for the full model was 87.396, relative to 138.909 for the null model. The pseudo-R2 statistics were \( {R}_{C\&S}^2=.773 \), \( {R}_{\mathrm{Nag}}^2=.881 \), and \( {R}_{\mathrm{Co}}^2=.892 \); the latter implies that seven of 65 assemblage types were misclassified. Of the seven misclassified assemblage types, six consisted of single assemblages whilst one consisted of two assemblages; thus, eight assemblages were misclassified in total, leading to an overall accuracy for individual assemblages of 83/91 = .912. The accuracy achieved is lower than the 91/92 = .989 achieved using neural networks by Grove and Blinkhorn (2020), but this is to be expected as WBLR is a less sophisticated classification model. The incorrectly classified assemblages were Mumba M III 77 and Panga ya Saidi 11 (LSA misclassified as MSA) and Lukenya Hill GvJm46, Enkapune ya Muto RBL4, Mumba L V 81, Marmonet Drift H4, Marmonet Drift H5, and Laas Geel SU 711 (all MSA misclassified as LSA). A graphical summary of the regression output is shown in Fig. 2.

Fig. 2
figure 2

Assemblage frequencies plotted at binned regression scores for LSA and MSA assemblages. A regression score of less than 0.5 indicates an LSA classification via the logistic regression, with a regression score of greater than 0.5 indicating an MSA classification; as such, blue (MSA) bars with scores less than 0.5 represent MSA assemblages misclassified as LSA and red (LSA) bars with scores greater than 0.5 represent LSA assemblages misclassified as MSA

Coefficients for individual technologies and their likelihood ratio statistics are given in Table 2. Significant coefficients were found for backed pieces, bipolar technology, blade technology, Levallois flake technology, and point technology. Signs of the coefficients demonstrate that the former three technologies are associated with LSA assemblage types whereas the latter three are associated with MSA assemblage types. These results agree with those of Grove and Blinkhorn (2020), with the exception that the latter study also suggested the presence of core tools and scrapers as predictors of MSA assemblages.

Table 2 Logistic regression coefficients for individual technologies and associated likelihood ratio statistics; * denotes significance at α = 0.05

Permutation Analysis

The primary goal of this study was to assess the validity of the division of these assemblage types into the widely adopted categories of LSA and MSA. The results of the permutation test are shown in Fig. 3. Using the triangular distribution of group sizes, six of the 99,999 permuted divisions resulted in WBLR models that returned deviance values less than or equal to that of the LSA/MSA division, yielding \( {\hat{p}}_t \) = 0.00007. Using the uniform distribution, the equivalent figure was 18 of 99,999, yielding \( {\hat{p}}_u \) = 0.00019. The division of these assemblage types into LSA and MSA is thus highly significant by traditional statistical standards, suggesting that these labels provide a valid classificatory scheme for this material. More nuanced interpretations of this result are possible, however, and are discussed in detail below.

Fig. 3
figure 3

Results of permutation tests using (a) a triangular distribution and (b) a uniform distribution for group sample size. Red squares show permutations producing WBLR model deviances less than or equal to the empirical model deviance; blue squares show permutations producing WBLR model deviances greater than the empirical model deviance

Correlations between the sample size of the smallest group and WBLR model deviance were positive and significant in both cases (triangular, r(99,997) = 0.192, p < 0.001; uniform, r(99,997) = 0.294, p < 0.001), demonstrating that models with greater sample size imbalance produce lower deviances. Overall, 3.29% of permutation models in the triangular analysis and 12.94% of permutation models in the uniform analysis were more imbalanced than the empirical model. Of the permutation models demonstrating lower deviance than the empirical model, 77.78% were more imbalanced than the empirical model when using the uniform distribution, but none were more imbalanced than the empirical model when using the triangular distribution. These results suggest that 14 of the permutation models that returned lower deviances than the empirical model when using the uniform distribution may have done so simply because they were more imbalanced; overall, however, there were at least ten models (four generated by the uniform distribution and six by the triangular distribution) that were better than the empirical model and could not be explained by statistical artefacts.

Discussion

The results of the weighted binary logistic regression reported above agree substantively with those of Grove and Blinkhorn (2020) in that backed pieces, blades, and bipolar reduction are seen as indicative of LSA assemblages whilst Levallois flakes and points are seen as indicative of MSA assemblages. Grove and Blinkhorn (2020) also found core tools and scrapers to be indicative of the MSA; in the current study, both are found to be more associated with the MSA than the LSA, but not significantly so. In relation to scrapers, it is worth noting Tryon’s (2019:267) finding that end scrapers are found more often in LSA contexts, with side scrapers more prevalent during the MSA.

These results also broadly agree with the intuitions of previous researchers regarding the associations of these technologies with the respective industrial complexes. Technologies that are indicative of each industrial complex, however, also occasionally appear in the other, recalling Goodwin’s (1946) point that there is considerable overlap between them and agreeing with Tryon’s (2019) recent description of the eastern African transition as a prolonged process with varying regional trajectories. It is therefore important, as per Grove and Blinkhorn (2020), to recognize constellations of indicative technologies rather than individual tool forms when discussing the dynamics of the transition.

The analyses undertaken here aimed to assess the validity of the MSA/LSA division, but did not assess whether each individual assemblage was ‘correctly’ classified to one of these two industrial complexes; without detailed examination of each and every assemblage, the policy of adopting the designation provided by the excavators in each case is clearly the only sensible one to follow. Similarly, the analyses reported here are dependent upon the excavators’ use of terminology for identification of the different technologies. Whilst only further archaeological study can robustly re-assign assemblages to alternative industrial complexes, statistical results can be informative concerning which assemblages might be prioritized for re-examination. An experiment in which each of the eight misclassified assemblages in turn was reclassified and the models re-calculated—with appropriate changes to all weightings—led to the results shown in Table 3.

Table 3 Statistics obtained by reclassifying the assemblages misclassified by the logistic regression analysis reported above and re-calculating the model

As expected, the above experiment suggests that, were any of the eight misclassified assemblages reclassified, reductions relative to the original model deviance of 40.375 could be achieved. Most of these reductions are relatively minor, however, and it is important to note that at this scale the deviance does not necessarily correlate with the number of assemblages misclassified. Whilst reclassification of Mumba M III 77 would lead to the greatest reduction in deviance, reclassification of either Lukenya Hill GvJm46 or Mumba L V 81 would lead to the greatest improvement in the number of correct assemblage classifications. Any reclassifications could only take place, of course, after careful archaeological examination of the assemblages in question.

There are numerous cultural, stratigraphic, taphonomic, chronological, and methodological factors that might either prompt re-investigation or suggest why a given assemblage is not fully indicative of the industrial complex to which it is attributed. To take Mumba as an example, Mehlman’s (1977) excavations (Mehlman 1979, 1989) were intended to address issues with the original excavations by Kohl-Larsen (Kohl-Larsen 1943). Nonetheless, he was only able to retrieve relatively limited samples (Mehlman 1989:78), and many of these remain unstudied (Prendergast et al. 2007). Subsequent studies have focused on the transitional nature of Mehlman’s (1989) Mumba Industry, located primarily in the Bed V horizons of the site, and on more comprehensive dating of the deposits so as to recognize patterns of change and innovation (Mabulla 2007; Prendergast et al. 2007; Diez-Martin et al. 2009; Gliganic et al. 2012; Bushozi et al. 2020). If the Middle Bed III samples recovered by Mehlman in 1977 are MSA, they would be stratigraphically and chronologically anomalous, particularly given the results of Gliganic and colleagues (Gliganic et al. 2012; see also Diez-Martin et al. 2009; Eren et al. 2013) who argue for a relatively early LSA associated with abundant ostrich eggshell beads beginning in Upper Bed V at 49.1 ± 4.3 ka. A realistic explanation for the effect of the Mumba M III 77 assemblage on the above analyses, therefore, is that it is a relatively small assemblage that is not fully indicative of its LSA provenance.

he misclassifications of some other assemblages, such as Lukenya Hill GvJm46 and Enkapune ya Muto RBL4, may be due to the fact that previously published inventories rely on partial samples from selected test pits, from depositional contexts that are not clearly established, or from sparse occupations that may not have resulted in extensive or indicative lithic accumulations (e.g. Miller & Willoughby 2014; Kelly 1996:271; Ambrose 1998:384). Ideally, future analyses would consider individually the various processes that act in combination to generate archaeological samples; in practice this is rarely possible, but a valid (albeit post hoc) alternative would be to subject those assemblages that have been misclassified in the above analyses to further investigation in relation to such processes.

As highlighted above, these analyses and their results depend upon the collation of data previously published by numerous researchers. The database therefore inevitably encompasses differences not only in the terminology used to describe individual lithics but also in the techniques employed in excavation and recording. Excavation by context, for example—where ‘context’ is defined as a homogenous unit of the matrix, regardless of its vertical or horizontal extent—leads to a different concept of ‘assemblage’ than does excavation by regular, arbitrary spits. Ideally, an assemblage—however, defined in terms of the sedimentary matrix—would equate to a discrete occupation horizon, but of course this is rarely the case. The amalgamation of synonyms into a broad typology of 16 technologies largely removes concerns about the inconsistent use of terminology for individual lithics, but inconsistencies in the delineation of assemblages remain. Such inconsistencies are unavoidable in a study of this kind—after all, the material cannot be excavated again—and in the current study, they do not appear to introduce any systematic bias in terms of the results. For example, assemblages defined by arbitrary spits (or groups thereof) are no more likely to be misclassified than those defined by archaeological contexts. This issue does, however, starkly reveal the fact that the problems facing archaeological taxonomies act at multiple scales.

The taxonomy of individual lithics has been criticized on the basis that it discretizes the continuous variation produced either by a reduction continuum or by spatio-temporal variation in the cultural production of functionally equivalent tools (e.g. Davidson and Noble 1993; Davidson 2002). The process of dividing excavated material into assemblages—except in those rare cases where such assemblages are bracketed by sterile layers—is a second process by which continuous variation is discretized. Finally, assemblages are categorized by technocomplex, which further masks the continuity between them. Most archaeological analyses, therefore, depend on various, cumulative methods of discretization; comparisons between periods or between regions cannot be accomplished without the application of such methods. The resulting analyses are often genuinely valuable, but archaeologists must also remain cognizant of the limitations the underlying methods impose.

The results of the permutation analyses reported above suggest that the LSA/MSA division is valid based on a standard statistical criterion (i.e. α = 0.05, 0.01, or even 0.001), but that it is not the single best division of the data; thus, the weak form of the hypothesis is supported, but the strong form is not. The history of archaeology as a largely descriptive discipline, with quantitative hypothesis testing emerging as a significant component long after the establishment of our cultural taxonomies, leads to a situation in which statistical analyses are being used as post hoc tests of those taxonomies (see also Ivanovaite et al. 2020). On the one hand, this is regrettable, but on the other, it is important to stress that the meaning—and therefore the usefulness—of our taxonomies must emerge from archaeological rather than from statistical reasoning. Ideally the two would be complementary, but the complexity and paucity of the archaeological record often stifle this alliance.

As an example of why archaeological reasoning must take precedence in such cases, Fig. 4 shows a (purely theoretical) series of assemblages plotted in two dimensions; these dimensions could be counts of two tool forms, or more realistically the first two axes of a principal components analysis. The two dashed lines in the figure are both examples of complete separation in this two-dimensional space; that is, a binary logistic regression or linear discriminant analysis could perfectly separate the data into two groups to either side of either of these lines based purely on the two axes shown (vertical or horizontal divisions would only need one axis to do so). Yet, there are any number of additional lines that could also achieve such separation; all would be statistically equivalent, but would any be archaeologically meaningful?

Fig. 4
figure 4

Theoretical plot of the first two axes of a principal components analysis on a group of archaeological assemblages. Dashed lines represent two possible examples of complete separation

The above is an example of an unsupervised analysis, in which patterns are sought in the data without prior labelling; a complete overhaul of archaeological cultural taxonomy would necessarily be built on the results of such analyses. The WBLR reported above is a supervised analysis, in which coefficients are sought that divide the data as well as possible into categories to which they have been assigned a priori. The permutation analyses undertaken here exist in the space between supervised and unsupervised analyses, as each permutation returns the result of a supervised analysis in which membership of the a priori categories is assigned at random. Archaeology is currently in a position whereby revision of existing cultural taxonomies is likely to be more beneficial than building those taxonomies anew from the ground up; as Reynolds and Riede (2019b:1369) state, ‘there is structure in the archaeological record, and abandoning taxonomies altogether would limit… the types of questions that we can ask’. In this context, further exploration of the space between supervised and unsupervised analyses is likely to prove useful.

The problems of cultural taxonomy discussed here are certainly not limited to archaeological endeavours. For example, some British architectural styles correspond broadly to chronological periods, but these styles frequently overlap, and their specific durations are disputed, even when their labels derive from correspondence to the reigns of particular monarchs (e.g. Victorian, Edwardian). The differences between Victorian and Edwardian domestic architecture (fewer storeys, higher ceilings, broader hallways behind wood-framed porches in the latter) are fewer than their similarities; as such, much like lithic industries or biological species, they grade into one another when viewed from a sufficient chronological distance. With the exception of architectural styles that consciously derive inspiration from previous periods (e.g. Neoclassical), labels are applied post hoc in much the same way that they are in archaeological systematics. ‘Edwardian’ architects did not set out to create a distinctly ‘Edwardian’ style as a counterpoint to the previous ‘Victorian’ style; instead, differences can only be discerned in hindsight by scholars working in later periods. Nonetheless, these labels act as useful heuristic devices for the discussion of changing architectural styles through time, serving much the same purpose as our archaeological nomenclature. If the labels did not exist, the discussion could not proceed, and this would be detrimental not only to systematics itself but also to broader understanding. Attempts to simply abandon existing cultural taxonomies—in archaeology as in any other discipline—are therefore entirely without value; attempts to revise existing taxonomies must be grounded in first-hand re-examination and logical assessment of affinities between large numbers of assemblages (see Shea 2020 for a recent example). Current archaeological taxonomy may resemble a ‘house of cards’ (Reynolds and Riede 2019a), but it would be premature to pull this house down before a new one has been built.

The analyses carried out above examine assemblages at the scale of industrial complexes and do so by recording the presence or absence of 16 technologies within each assemblage. This is a relatively common approach to the African Stone Age record (e.g. Tryon and Faith 2013; Blinkhorn and Grove 2018; Grove and Blinkhorn 2020; Shea 2020) but clearly operates at a very different scale to analyses of, for example, metric attributes of individual tool forms (e.g. O'Brien et al. 2014; Ivanovaite et al. 2020). Different questions demand different scales of analysis, and it is often the case that analyses at finer scales can only proceed by deliberately ignoring patterning at coarser scales. To employ a biological example, traits that distinguish genus one from genus two are unlikely to be useful in distinguishing between two species that both belong to genus two because the attribution of those species to genus two necessarily implies that they both display those traits. These shared or ‘primitive’ traits are of no use in pursuing the finer-scale division between species. In much the same way, it may be feasible to support broad scale archaeological cultural taxonomies (e.g. MSA, LSA) whilst simultaneously questioning their subdivisions (e.g. Nubian; see Groucutt 2020).

Perhaps the most substantive problem with existing archaeological cultural taxonomy stems from the way in which it is interpreted and used. This stems from a lasting culture-historical legacy that equates particular groups of artefacts or artefact types with particular groups of people; the ‘culture’ of a people is explicitly manifest in the material culture assemblages those people produce, and the assemblages therefore indicate the people. In this regard, Kleindienst (2006:17) argues that the Burg Wartenstein recommendations were ‘fatally flawed’ because ‘those in favour of such a system could not persuade their colleagues… to leave the “group of prehistoric people” out of the definition of the “Basic Unit”’ (i.e. the ‘Industry’ as defined in Bishop and Clark (1967:893)). The idea that ‘cultures’ in this sense are immutable and inextricably linked to groups of people permits migration and diffusion but ignores both the ability of hominin actors to flexibly respond to changing circumstances and the possibility of convergent responses of different temporally or geographically distant groups to similar circumstances.

A stark alternative to the ‘group of prehistoric people’ perspective sees the production of material culture primarily as a functional reaction to ecological circumstances and provides markedly different interpretations of the same datasets (e.g. Bordes 1961; Binford and Binford 1966; Bordes and De Sonneville-Bordes 1970; Binford 1973). If a recurring assemblage of archaeological material is the physical manifestation of a distinct set of ideas belonging to a distinct group of people, then archaeological analyses tell us about the spatio-temporal history of that group of people; but if the same recurring assemblage represents just a subset of a group’s material repertoire, and if different groups employ the similar subsets when encountering similar circumstances, then analyses tell us more about the circumstances these groups encountered than about their social norms or cultural values. If one adopts the latter position, a further difficulty arises in the need to disaggregate those aspects of the assemblage (and of individual artefact form) that serve a direct subsistence function from those that serve a social, symbolic, or otherwise cultural function (e.g. Dunnell 1978; Brantingham 2007). Any taxonomy—cultural or otherwise—is constructed in reference to a particular analytical goal, with results interpreted in relation to a particular theoretical position.

Conclusions

The analyses presented above sought to test the integrity of the cultural taxonomic division between MSA and LSA assemblages in eastern Africa by comparing that division to a large sample of arbitrary divisions of the same data. Results suggest that the division is valid on the basis of any routinely employed statistical criterion, but that it is not the single best division of the data. These results invite questions about what archaeologists seek to achieve via cultural taxonomy and about the analytical methods that should be employed when attempting to revise existing nomenclature. Quantitative analyses are necessarily more robust than their purely descriptive counterparts but will only prove truly useful if their results can be interpreted in archaeologically meaningful ways. Archaeologists seek information about similarities and differences that characterize assemblages that originated in different periods and regions or that were produced under different environmental regimes. Such similarities and differences—where they occur—are often highly complex, existing at different scales and along multiple axes of variation. The sheer variety of hominin behaviour precludes simple classification, but classification is essential to discussion. Archaeological cultural taxonomies are largely heuristic devices, but they remain valuable, and—at least in the case of the eastern African MSA and LSA—they map onto important differences in the stone tool assemblages created by our ancestors.