Different hunting and herding strategies target specific classes of animals among a herd that is determined by the animal’s ontogenetic age and sex (Dahl & Hjort, 1976; Stiner, 1990). In addition to human-driven goals, sex differences in habitat use, diet quality, and reproductive capabilities among ungulate prey species contribute to the susceptibility and desirability of males and females at different ages to human exploitation (Corti & Shackleton, 2002; Post et al., 2001; Ruckstuhl, 2007; Ruckstuhl & Neuhaus, 2002; Saïd et al., 2011). These factors impact the formation of bone assemblages by affecting the probabilities that bones from different classes of animals (e.g., immature, adult female, or adult male animals) are deposited before being mediated by other taphonomic processes (Lyman, 2008). The ontogenetic age and sex composition of zooarchaeological assemblages can, therefore, reflect anthropologically-relevant aspects of past hunting strategies—like seasonal site use and scale of exploitation (Speth, 2013)—or general management goals of past herding strategies (e.g., Payne, 1973; Redding, 1984).

Reconstructing the ontogenetic age and sex composition of a zooarchaeological assemblage can enrich our understanding of past human-animal interactions by complementing mortality profiles and inter-assemblage comparisons. However, the disaggregated nature of faunal assemblages complicates efforts to conclusively identify the ontogenetic age and sex of a specimen. Because articulated remains are rare, zooarchaeologists typically cannot relate elements that are morphologically distinct between the sexes (e.g., the pelvis) to other elements that can provide information about the animal’s age-at-death (e.g., limb bones or mandibles). We can, though, take advantage of the general pattern of sexual dimorphism among ungulate taxa by using size differences in limb bones to distinguish between males and females.

Morphometric Sex Determination in Zooarchaeology

Some biometric methods to determine the sex of an animal bone are multivariate—using combinations of measurements with bivariate plots or discriminant functions to predict the sex of archaeological specimens based on distributions of known-sex specimens (e.g., Munro et al., 2011; Speth, 1983; Widga, 2006). These methods typically combine dimensions from different planes of an element (e.g., the breadth and depth of a distal articular end) to produce patterns that can be separated by a “cut point” between males and females, either visually in the case of bivariate plots or algorithmically in the case of discriminant functions. The analytical requirement that multiple dimensions of a bone be preserved in measurable condition, even on the same end of an element, may make it difficult to apply these methods to more heavily processed assemblages. Furthermore, specimens from animals that died before reaching adult body size may be misclassified as females, particularly for dimensions affected by post-fusion growth (Popkin et al., 2012).

Other sex determination methods are univariate—they use a single measurement from a specimen and typically use size index methods to associate those measurements from different elements together (e.g., Weinstock, 2006; Zeder & Lemoine, 2020). This approach allows general descriptions of the sex ratio in an assemblage that can be used to identify changes in these sex ratios or overall biometry over time (e.g., Arbuckle & Atici, 2013; Grigson, 1989). Zeder and Lemoine (2020) go further by using inter-quartile ranges of log size index (LSI) values from their reference population to create ‘cut-off’ values between immature, female, and male specimens to calculate specific ontogenetic age and sex ratios for elements and assemblages.

Regardless of whether the method uses multivariate or univariate data, these sex determination methods tend to have the same weaknesses. Practically, these methods rely on direct comparisons with reference populations (typically, but not always, modern populations of known sex). Thus, the analysis implies that the biometry of the archaeological population is the same as the reference population. However, this implication is an untenable one in most cases, as animal biometry typically varies spatially and temporally due to population-level intra-taxonomic genetic differences caused by adaptation to local climates and ecologies (e.g., Davis, 1982; Hill et al., 2008; Koch, 1986; Lebenzon & Munro, 2022; Wright & Viner-Daniels, 2015). Biometric variation in wild and domesticated taxa has also been attributed to anthropogenic pressures as a result of herding decisions or hunting pressure (e.g., Arbuckle & Kassebaum, 2021; Grau-Sologestoa & Albarella, 2019; Manning et al., 2015; Trentacoste et al., 2021), though harvest pressure has also been attributed to biometric changes in wild taxa (e.g., Wolverton, 2008; Munro et al., 2022. These environmental and anthropogenic pressures may affect males and females differently (e.g., Tchernov & Horwitz, 1991; Zohary et al., 1998); pressures that reduce sexual dimorphism could interfere with analyses, as more specimens may be indeterminate or misclassified. Biometric variation between populations can complicate efforts to estimate changes in the demographic (ontogenetic age and sex) composition of assemblages over time. Furthermore, efforts to control for ontogenetic age (e.g., removing unfused specimens or those from early-fusing elements) distorts the relationship between the analyzed specimens and the rest of the assemblage, decreasing our ability to make reliable inferences about the entire assemblage (Zeder & Hesse, 2000).

Philosophically, sex determinations made by these methods tend to be absolutist; specimens are identified as male or female (or immature) or are marked as indeterminate. As in taxonomic identifications, the use of absolutist determinations masks any underlying uncertainty in the determination (Wolfhagen & Price, 2017). Removing indeterminate specimens from consideration artificially reduces sample sizes and inflates reported accuracy rates. This produces a false sense of confidence in the sex determination results, especially when those results are then used to characterize the entire assemblage. More critically, any nuances or caveats in the sex determinations of an assemblage are lost when the results are used in synthetic analyses at larger spatial and temporal scales. What is necessary is a way to estimate the ontogenetic age and sex composition of a faunal assemblage that preserves the uncertainty inherent in the process.

Mixture Modeling in Zooarchaeology

Mixture modeling provides just such a method, producing probabilistic sex identifications rather than absolute ones by describing an assemblage of faunal measurements as a mixture of specimens from different animal groups (generally termed “mixture components”) like male and female specimens—described by parameters for the proportion of the overall assemblage (𝜋), average size (𝜇), and size variability (𝜎) of each animal group. A mixture model allows researchers to not only describe the overall composition of the assemblage but to also estimate the probabilities that a specific specimen belongs to a particular animal group (Dong, 1997; Monchot & Léchelle, 2002). Additionally, mixture modeling does not rely on a reference population, allowing biometric variation between populations and even changes in the extent of sexual dimorphism (e.g., Helmer et al., 2005). These features allow mixture models the flexibility to track both biometric and demographic variation across assemblages over time and space (e.g., Arbuckle et al., 2016; Arbuckle & Kassebaum, 2021).

Conceptually, a mixture model can be thought of as a “latent state” or “missing data” problem; we know that measured specimens come from particular animal groups, but that information has been lost (Marin et al., 2005). If we knew every specimen’s group identity, then the calculation of the mixture model parameters (mixture proportion, average body size, and size variability for each animal group) would be trivial. In archaeological contexts, however, we cannot directly observe those group identities; we must, therefore, use probabilities of group membership and calculate group-specific parameters from those resulting probabilities (Monchot & Léchelle, 2002). Figure 1 describes a schematic example of a mixture model; Fig. 1A shows the distribution of LSI values from a reference population of 31 adult pig (Sus domesticus) tibia distal breadths (Tibia Bd: von den Driesch, 1976) described in Zeder and Lemoine (2020), with specimens colored by their known identity (females = blue, males = red). Figure 1B shows the results of fitting a two-component (females and males) mixture model to the data using standard approaches (e.g., Arbuckle & Kassebaum, 2021; Monchot & Léchelle, 2002), ignoring those true identities.

Fig. 1
figure 1

Walkthrough of the mixture modeling procedure using pig Tibia Bd measurements from Zeder and Lemoine (2020). A The distribution of Tibia Bd LSIe values (standard value: 33.5 mm; Hongo & Meadow, 2000) of adult specimens, shaded by known sex (females = blue, males = red). B The result of a standard (maximum-likelihood estimation) mixture model analysis on the LSIe values, ignoring sex. Vertical lines show the estimated means for females and males; curves show the relative probability densities for the two distributions. The point shows the probability that a specimen with a particular LSIe value (0.06) would be considered female by the model results. C Four simulations using the membership probabilities of each specimen based on the mixture model results

The mixture model describes the assemblage as a mixture of the two “mixture components” (males and females); each component is described with three parameters: a proportion (𝜋), an average size (𝜇), and a standard deviation (𝜎). Taken together, these parameters determine a specimen’s probability of being in one of the groups, as shown in Fig. 1B; a specimen with an LSI𝑒 value of 0.06 has a 31% probability of being female based on the model. This results in 31 sets of probabilities, one for each specimen; Fig. 1C shows four plausible simulated assemblages that result from the mixture model; every specimen’s membership probability is used to simulate a “true” identity. Importantly, by leaving the mixture model results as specimen-specific probabilities of being female or male, mixture model results retain the uncertainty of the sex determination process; that is, a specimen with a 51% probability of being female is not treated as equivalent to one with a 95% probability of being female.

The previous example showcases the benefits of mixture modeling as a flexible probabilistic sex determination method. First, the model does not require comparison to a reference population to estimate differences between female and male specimens. Second, the model produces parametric estimates of body size and size variability that can be used for inter-site comparisons, rather than just determinations for the included specimens. Finally, the model produces probabilistic estimates for every specimen, rather than leaving some specimens indeterminate and obscuring variation in the confidence of the sex assignments. These theoretical and practical advantages of mixture modeling and its potential for zooarchaeology have been apparent since its introduction to the field (e.g., Dong, 1997; Monchot et al., 2005; Monchot & Léchelle, 2002), though its application has been piecemeal over the past two decades despite the existence of free scientific software that can perform the analysis (e.g., PAST: Hammer, 2022; R packages “mixtools” and “mclust”: Benaglia et al., 2009; Scrucca et al., 2016).

The reasons for the patchy application of mixture modeling in zooarchaeology are less straightforward. High-profile early case studies of mixture modeling report size variability parameters (standard deviation 𝜎) that vary widely and include very small values for some groups (e.g., De Cupere et al., 2005; Monchot et al., 2005; Vigne, 2011). These results suggest that the very flexibility, that is, a great strength of mixture modeling, is actually identifying “groups” that are not necessarily consistent with biological expectations (e.g., that the results are “overfitted” to the observed data). Such extreme differences in the standard deviation of different groups can result in counterintuitive implications; specimens may be considered more likely to come from the broad distribution (the one with the larger 𝜎 parameter) than the narrow distribution even when the value is more extreme than the narrow distribution’s mean (e.g., is larger than a larger mean or smaller than a smaller mean). Returning to the mixture model example in Fig. 1 can explain this issue more clearly. Table 1 shows the mixture model parameters for the two components; the standard deviation (𝜎) for females is more than twice the standard deviation for males (𝜎1 = 0.034, 𝜎2 = 0.016). As such, higher numbers beyond the observed range will be considered likely females; an LSI𝑒 value of 0.173 (Tibia Bd value: 39.83 mm) is more likely to be a female than a male using the mixture model’s results (probability of being female: 51%).

Table 1 Mixture model parameter estimates for the pig distal tibia Bd example

Published mixture model examples show this issue, as well. De Cupere et al. (2005), Table 2) report three groups of chicken carpometacarpus lengths from bones with medullary bone, providing the full set of mixture model parameters (group 1: proportion 𝜋 = 0.285, mean 𝜇 = 33.337, standard deviation 𝜎 = 0.3; group 2: 𝜋 = 0.608, 𝜇 = 35.416, 𝜎 = 0.433; group 3: 𝜋 = 0.107, 𝜇 = 37.866, 𝜎 = 0.094). According to De Cupere et al. (2005), fig. 3), there is one carpometacarpus with a medullary bone whose greatest length is roughly 41.5 mm. Counterintuitively, the analysis would suggest that this specimen is most likely to be a member of group 2; it even determines that the specimen is more likely to be a member of group 1 than group 3. Vigne (2011), Table 3) reports mixture modeling results of LSI10 values from cattle recovered from Neolithic Shillourokambos, Cyprus, to estimate females and males, using PAST (Hammer, 2022). The reported values for the Recentes phase (female 𝜋 = 0.75, 𝜇 = 0.120, 𝜎 = 0.042; male 𝜋 = 0.25, 𝜇 = 0.163, 𝜎 = 0.007) produce counterintuitive results; a specimen with an LSI value of 0.176—within the range of LSI values from this phase (Vigne, 2011: fig. 2)—would be considered more likely to be female than male. These issues extend to more recent publications. Arbuckle et al. (2016), fig. 5) report sex-specific LSI10 average sizes for cattle in the eastern fertile crescent during the early-mid Holocene; because they report their LSI10 data in a supplement, it can be shown that the smallest measurement from Ganj Dareh (LSI10 = −0.044, modeled female mean = −0.019, modeled male mean = 0.024) is considered more likely to be male than female due to the extreme differences in standard deviations.

Table 2 Definitions of key terms used in this paper
Table 3 Software packages used in the analytical script for this paper
Fig. 2
figure 2

Examples of prior distributions for demographic parameters (proportion of immature animals or adult sex ratio) with different expectations. A A distribution where extreme values are considered unlikely but otherwise most values are about equally as likely. B A distribution where extremely high or extremely low values are likely but indeterminate values are much less likely. C A distribution where it is extremely likely that values are centered around 67%. The last scenario would only be appropriate if there is sufficient prior knowledge about the context.

These examples highlight the difficulties researchers face when interpreting the results of mixture analyses of zooarchaeological data. While mixture modeling provides the flexibility to model data from a pre-specified or unknown number of groups, there is no guarantee that the identified “groups” are biologically meaningful. Analysts may identify inconsistent results from mixture analyses and exclude the analysis from reports, leaving only mixture analyses that appear to have interpretable results (the “file drawer problem”: Rosenthal, 1979). As these examples show, however, mixture analyses applied to more abstract quantities, like LSI values, or interpreted in light of less easily interpreted biological groups, like breeds, can have counterintuitive implications. These examples are not meant to highlight the errors; on the contrary, the fact that the authors report their full model results and/or data mean that such errors could be identified, highlighting the importance of open scientific reporting and publishing (Marwick, 2017; Ram & Marwick, 2018).

Zooarchaeologists have a wealth of reference information that can inform them about the impacts of diet, sex, castration, and other factors on the size and variability of animal bones. These reference populations provide raw measurements from several taxa and generally include specimens of known age-at-death and sex, though sometimes these include archaeological data of (relatively) complete individuals that can be assigned to sex (e.g., sheep: Popkin et al., 2012; Davis, 1996, 2000; pigs: Zeder & Lemoine, 2020; Payne & Bull, 1988; aurochsen/cattle: Degerbøl, 1970; bison: Speth, 1983; Todd, 1983). These data can provide useful information that could be relevant for interpreting a mixture model analysis; ideally, one could take advantage of relevant information from reference populations while still maintaining some aspects of a mixture model’s flexibility. Unfortunately, standard mixture modeling algorithms do not provide a straightforward way to ensure that the model parameters (𝜇 and 𝜎) for the groups accord with our understanding of these parameters from reference populations. Bayesian inference, however, does provide a way to do this very thing by using data from reference populations to create prior distributions for mixture model parameters. Using prior distributions improves overall model performance because the analyst can use these sources of ‘prior knowledge’ to inform them about the data that they have on-hand (Otárola-Castillo et al., 2022).

This paper describes a Bayesian approach to the mixture model analysis of faunal measurements that addresses these weaknesses of mixture modeling as currently applied. The model uses informative priors derived from a “prior assemblage” of known age-at-death and sex individuals to constrain population parameter estimates to be biologically interpretable (Popkin et al., 2012). It also uses multilevel modeling to take advantage of partial pooling and address aggregation issues to directly estimate parameters for each measured dimension in the analysis (Gelman, 2006a; Wolfhagen, 2020). In addition to modeling females and males, the model includes a third group consisting of “immature” specimens that died before reaching adult body size. The model also emphasizes inference of the entire assemblage rather than just the measured specimens by incorporating observations of the sex ratio (from morphological data) and the proportion of immature specimens (from fusion data) to inform population parameters of the proportions of these different groups. The model is used on sixteen simulated assemblages derived from the (Popkin et al., 2012) Shetland sheep (Ovis aries) population to test its ability to accurately estimate the age and sex composition of assemblages. Two archaeological case studies then show the applicability of the model to archaeological assemblages for reconstructing the age and sex composition of assemblages and to highlight the importance of incorporating immature specimens into mixture modeling analyses.

A Bayesian Multilevel Mixture Model for Zooarchaeological Measurements

The Bayesian model developed for this paper improves on standard mixture modeling for zooarchaeological measurements in four distinct ways. First, it addresses complications caused by measurements from unfused specimens and post-fusion growth by modeling three groups within the mixture: immature animals, (adult-sized) females, and (adult-sized) males, each with distinct size parameters. Second, the multilevel structure allows the model to balance bias due to aggregation and overfitting from small sample sizes. Third, the Bayesian foundation of the model provides an avenue for synthesizing information about the ontogenetic age and sex composition of the assemblage from non-metrical data (e.g., fusion rates, sex ratios based on morphological data) to inform the results of the mixture model. Finally, researchers can create prior distributions for mixture model parameters from prior assemblages or other sources, ensuring that the mixture model results are biologically interpretable. This section outlines these benefits; specific details of the model are described in a Model Supplement and in the analytical code (available at the project’s GitHub page:

Observed measurements from different dimensions (e.g., humerus distal breadth “Humerus Bd,” radius proximal breadth “Radius Bp,” abbreviations following von den Driesch 1976) are first converted to logarithmic size index (LSI) values using a natural logarithm base to take advantage of the normalization LSI standardization provides (Meadow, 1999; Wolfhagen, 2020). The model uses a single LSI value per specimen, so any specimens with multiple observed dimensions are first summarized by estimating the mean LSI value from the observed dimensions; these specimen-level LSI values are the basis for the mixture model. These specimen-level LSI values can be clustered into different element portions—partial or complete elements that are the basic categorical unit of an analysis (e.g., “distal humerus” or “first phalanx”; compare to “skeletal part type” in Breslawski 2023). Table 2 provides a glossary of the key terms used in this text.

Benefits of the Bayesian Multilevel Mixture Model

Body size is affected by both ontogenetic age and sex; animals killed before reaching adult body size pose a complication for most sex determination models, which exclusively focus on distinguishing between (adult-size) female and male animals (but see Zeder & Lemoine, 2020). Measurements from known age-at-death Shetland sheep show that specimens killed younger than 1 year of age are significantly smaller than those killed at older ages, regardless of fusion status and sex; after 1 year of age, size is no longer significantly impacted by age (Popkin et al., 2012). Thus, any measurement from an unfused epiphysis or from an element portion that does not fuse or exhibits significant post-fusion growth should be considered potentially immature and needs to be modeled with a three-member mixture model (group 1 = immature, group 2 = adult female, and group 3 = adult males). On the other hand, the model excludes the possibility that measurements from specimens that are conclusively not immature due to their fusion status could be from the immature group (𝜋1 = 0), effectively fitting a two-member mixture model (adult females and adult males).

Typically, biometric analyses aggregate LSI values from different element portions (e.g., Arbuckle & Kassebaum, 2021; Sasson & Arter, 2020; Vigne, 2011); aggregation produces bias because it assumes that every element portion has the same parameter value (Wolfhagen, 2020). Multilevel modeling uses partial pooling to allow the cluster-specific parameters to vary between clusters while reducing overfitting caused by small sample sizes (Fernée & Trimmis, 2021; McElreath, 2020). In the case of this mixture model, element portions are the relevant clusters—the multilevel model produces a set of mixture model parameters for each element portion (a set of 𝜋, 𝜇, and 𝜎 parameters for each of the three animal groups). These cluster-specific parameters are related to each other through “hyper-parameters” that describe the average value of the mixture model parameters and the variability of model parameters across element portions (Wolfhagen, 2020). This structure reduces overfitting caused by small sample sizes among some clusters while also avoiding the bias caused by aggregating all clusters together.

Of course, biometric data are not the only source of information on an assemblage’s ontogenetic age and sex composition. Fusion rates of elements that fuse around the age that animals reach adult body size can provide relevant information on the proportion of immature specimens in the assemblage (e.g., first and second phalanges in sheep: Popkin et al., 2012), just as sex ratios derived from morphologically distinct adult elements provide information about the adult sex ratio in an assemblage (e.g., fully fused pelvises: Stiner et al., 2022; horn cores: Twiss & Russell, 2009). These estimates of assemblage composition do not supersede those produced by a mixture model, but they are also not irrelevant to the composition from a mixture model. Unlike other sex determination methods, the multilevel structure of the Bayesian multilevel mixture model allows the analyst to inform their model results with relevant fusion and morphological sex data from the assemblage. These data do not determine the proportion of immature animals and the adult sex ratio of the mixture model, but they do help the model make more precise estimates of the ontogenetic age and sex composition of the assemblage than possible with the measurement data alone.

Relevant information from a prior distribution can inform an analyst about reasonable values for the model’s hyper-parameters, which can be summarized as prior distributions. Creating informed prior distributions allows the model to ensure that the hyper-parameters have biologically interpretable results (e.g., average size and size variability parameters for a measurement that align with reasonable expectations for a taxon). The multilevel structure of the model then ensures that mixture model parameters can vary between different element portions while still being informed by these hyper-parameters to maintain biological interpretability, even with small numbers of observations. Prior distributions draw explicit links between our sources of prior knowledge (e.g., reference populations, ethnographic data, ecological data) and our archaeological data. Unlike absolutist models, we can define the prior distributions used in the Bayesian multilevel mixture model to be less specifically focused on the parameter values of the prior assemblage. Increasing the uncertainty of assemblage-derived prior distributions allows the mixture model to adjust to biometric differences between the prior assemblage and the assemblage being fit by the model. Care must still be taken to ensure that prior distribution definitions are at appropriate scales for the observations and not so broad as to include values that are known to be physically impossible (e.g., Gabry et al., 2019).

Developing Prior Distributions from a Prior Assemblage

Prior distributions are central to Bayesian inference and describe one’s prior beliefs in the potential values of a model parameter. Prior distributions can be likened to a “filter” from which parameter values are drawn to evaluate their fit with the data (Smith & Gelfand, 1992). Several approaches exist for deciding how to describe this prior belief, ranging from “objective” priors that provide equal weight to all possible values of a parameter to distinct distributions defined by a synthesis of previous or related research (Gelman, 2006b). Objective priors poorly reflect our intuition about phenomena we are modeling, waste computing effort by sampling parameter values that poorly fit the data, and can introduce errors into our analyses (Gabry et al., 2019); instead, “weakly informative priors” or “reference priors” use transformations of parameter values—like centering and scaling element portion-specific parameters—to describe variation in parameter values within reasonable values, with small deviations being more likely than large deviations (Gelman et al., 2008). Informative priors are derived from relevant knowledge, be it the results of earlier studies on the same subject, the quantification of expert opinion, or parameter values for related subjects (McCarthy & Masters, 2005; Otárola-Castillo et al., 2022). Regardless of the distribution’s source, it is important to evaluate how well the distribution reflects your prior knowledge about the system under study because the prior distributions influence the results of the analysis.

The mixture proportions summarize the composition of the assemblage and mediate the relative likelihoods of the different animal groups, adjusting a specimen’s membership probabilities. Prior distributions for the mixture components reflect our prior beliefs about the relative proportions of immature, adult-sized females, and adult-sized males in the assemblage. Instead of estimating the prior belief for each of these three related categories, the model uses two prior distributions to estimate independent variables: the proportion of immature animals (𝜋1) and the adult sex ratio estimated through the relative proportion of adult females (\(\frac{\pi_2}{\pi_2+{\pi}_3}\)) (see Model Supplement for more details). The following examples show some of the flexibility researchers have when describing their prior belief about the proportion of immature animals or the adult sex ratio in an assemblage; the Model Supplement shows the mathematical details necessary to create relevant prior distributions for a model.

Figure 2 shows three examples of prior distributions for one of these mixture proportion concepts (proportion of immature or adult sex ratio) that reflect different expectations based on prior knowledge. Note that the Bayesian mixture model uses the observations of fusion rates and morphological sex ratios as observed data, so these prior distributions reflect knowledge prior to even those observations. Figure 2A shows a relatively broad, or uncertain, prior distribution where a researcher does not believe that the proportion is extreme (i.e., that the assemblage is either dominated by or bereft of immature animals or that the sex ratio is dominated by females or males) but has little opinion otherwise. Figure 2B shows a somewhat inverse situation, where the researcher is confident that the proportion (either the proportion of immature animals or the adult sex ratio) is at either one extreme or the other but is not sure which extreme it is. Figure 2C displays a scenario where the researcher is confident that the proportion is centered around 67% before looking at the faunal data, presumably based on prior research or other contextual information.

Prior distributions for parameters governing average body size (𝜇) and size variability (𝜎) are the keys to ensuring that the mixture model produces biologically feasible and interpretable results. These prior distributions are based on analysis of a “prior assemblage” created by sampling immature, adult female, and adult male/castrate Shetland sheep from the Popkin et al., (2012) population (150 specimens for each animal group; see Model Supplement for more details). Castrates were considered males for the purposes of the model, as proximal and distal bone width measurements as a whole did not vary significantly between intact males and castrates (Popkin et al., 2012: 1783–1784). Modeling the average body size and size variability of these animal groups using a multilevel model created the starting point for prior distributions that could be used in the model. To generalize the prior distributions so that they are applicable to a variety of zooarchaeological scenarios, the results from the prior assemblage were given larger standard deviations to increase the uncertainty, which allows the model to better fit the data (see Fig. 3 and Model Supplement for more details). The prior distributions act to prevent the model from accepting parameter values that are implausibly large or small given our prior knowledge about size variability and size differences between animal groups.

Fig. 3
figure 3

Posterior distributions of model hyper-parameters from a sample of known-identity sheep specimens (red) and proposed prior distributions for mixture model applications (black). A Average female size (in LSIe scale). B Immature size penalty (average size difference between females and immature animals). C Index of sexual dimorphism (average size difference between males and females). D: Size variability for immature animals. E Size variability for female animals. F Size variability for male animals. Proposed prior distributions provide a useful baseline in the absence of relevant biometric information regarding sexual dimorphism and size variability

Prior predictive checks, shown in the Model Supplement, show the implications of the prior distribution definitions used in the model. The results show that the chosen prior distributions do not exclude the possibilities of extreme values in mixture components and cover a wide range of potential size measurements. On the other hand, these model definitions are not so broad as to include much prior weight on biologically impossible values (i.e., impossibly large measurements) that would slow down how quickly the model runs because it must evaluate the fit of extremely poorly fitting data. Such behavior could also produce biologically implausible results of actual model fits when only a small amount of data are available.

This section describes one approach for defining prior distributions of mixture models that are relevant for a wide range of zooarchaeological cases, particularly if researchers do not have strong preconceptions about the relevant parameters from prior research. It is important to remember that a model’s prior distributions are choices made by the researcher to fit particular research questions, regardless of whether the distributions are informed by advice on reference priors, prior assemblages, or mathematical summaries of existing research. Zooarchaeological assemblages resulting from catastrophic kills would be expected to have a different ontogenetic age and sex composition compared to assemblages derived from sustained hunting or herding take-off (e.g., Lyman, 1987). Similarly, other research contexts may provide an analyst with different prior expectations about animal size variability and overall biometry. Other reference populations, particularly those from other taxa, could also be used to create prior assemblages and help determine the limits on biological feasibility. Researchers could and should adapt their prior distributions to best reflect their intuition about likely parameter values for their research context. Regardless of the prior distributions a researcher uses, it is crucial to formally describe the prior distributions that are used in Bayesian analysis to ensure replicability. Furthermore, researchers should examine the implications of the different candidate prior distributions while developing a Bayesian model to test a research question; prior distributions should be regularly tested even before models are fit to datasets (Gelman et al., 2020).

Extending the Multilevel Analysis to Multiple Sites

The multilevel structure of the model that allows parameters to vary across element portions can also be used to extend the modeling approach to examine multiple assemblages at once. Combining multiple assemblages into a single model allows researchers to investigate regional variation in herd management strategies or outline diachronic trends in body size that may relate to population turnover (e.g., Arbuckle et al., 2016; Arbuckle & Atici, 2013). It also allows researchers to model diachronic changes over the course of a multi-period site’s occupation, as each occupation layer can be defined as a separate assemblage (e.g., Hongo et al., 2009; Wolfhagen et al., 2021). By including the assemblages in the same model, estimates share the same hyper-parameters, which improves the precision of these estimates and allows researchers to directly compare assemblage-specific parameters by using contrasts. Furthermore, adopting this structure provides the foundation for more sophisticated analyses that test specific hypotheses about variations in biometric or compositional parameters, such as spatiotemporal autocorrelation in body size.

An important consequence of extending the model to evaluate multiple sites at once is that the interpretation of the overall hyper-parameters that the researcher inputs into the model changes. Instead of describing the overall estimates for a specific assemblage, these hyper-parameters now describe a “grand mean” of the parameter value for all the included assemblages. These overall summaries could be interpretively useful if, for instance, all the assemblages come from a discrete archaeological culture or region. In other scenarios, however, the interpretation of these overall hyperparameters may be less meaningful than comparisons of assemblage-specific estimates that still account for anatomical variation within each assemblage (see Model Supplement for more details).

Interpreting Model Results: Measured, Modeled, and Full Assemblages

The results of the Bayesian multilevel mixture model include specimen-specific membership probabilities (𝜋Specimen) based on the mixture model parameters. While these membership probabilities can be used to calculate “critical size limits” where the largest membership probability shifts from one group to another (e.g., Monchot & Léchelle, 2002), they can also be used to simulate assemblages of known-group specimens to examine age/sex-stratified estimates of body part representation and sex-stratified fusion rates. Membership probabilities (𝜋Specimen) are used to simulate the specimen’s identity by sampling from the probabilities using a multinomial distribution; in each posterior sample, a single simulated assemblage is created, resulting in a distribution of simulated assemblages with known age/sex assignments (Crema, 2012). The characteristics of these assemblages can then be used to summarize the overall assemblage or identify differences in composition based on element types, fusion states, sub-assemblage features, or other pertinent factors that a researcher is interested in examining in relation to the composition of the assemblage.

The usual goal of a mixture model analysis—like any sex determination analysis—is to estimate the composition of the entire (or modeled) faunal assemblage, rather than just the measured assemblage used by the analyst. Typical analyses elide these differences, smoothly translating the results of an analysis on a measured assemblage (i.e., the sex ratio) to describe the entire assemblage. Sometimes, disparate results from different element portions require explanation, such as different butchery strategies for males and females (e.g., Speth, 1983), but even in these cases, the results from measured specimens are used to describe the entire set of bones from the same element portion. This elision creates a bias by ignoring the existence of unmeasured specimens in the assemblage and presents an interpretive dilemma for researchers, whose only recourse if they are unwilling to make this elision is to discount the model results as unrepresentative.

We can avoid this bias by formalizing the relationship between the measured and modeled assemblages by stating that the measured assemblage is a sample of the modeled assemblage, wherein inclusion is governed by a specimen’s measurability—the preservation of specific bony portions that allow for biometric measurement(s). If we assume that measurability is unrelated to a specimen’s ontogenetic age or sex, then we can assume that the measured assemblage is a random sample of the modeled assemblage. Thus, an unmeasured specimen will have the same model parameters (mixture proportions 𝜋, average size 𝜇, and size variability 𝜎) as the measured specimens from the same element portion. Crucially, this means that we can include unmeasured specimens in our simulated assemblages by using the relevant mixture proportions 𝜋 (adjusted for the specimen’s fusion data as necessary) as that specimen’s membership probabilities (𝜋Specimen). Leveraging the multilevel structure of the model further, we can assume that the overall mixture model hyper-parameters for modeled element portions are equally valid for unmodeled element portions. The Bayesian multilevel mixture model estimates hyper-parameters that describe the average value (𝜇Element) and expected variability (𝜎Element) of mixture model parameters for element portions; these hyper-parameters can be used to estimate the relevant mixture model parameters of an unobserved element portion (Gelman et al., 2020; McElreath, 2020). The resulting parameters, then, could be used to estimate 𝜋Specimen membership probabilities for the unmodeled (and unmeasured) specimens, as in the first extension, creating an estimate of the composition of the full assemblage.

At first blush, these extensions may seem like a departure from the concrete results of a mixture analysis into proxy-upon-proxy esoterica. However, by formalizing the relationship between what data are in the mixture analysis (the measured assemblage) and what data we are interested in describing (the modeled or full assemblage), these extensions are critical for creating a principled interpretation of an assemblage based on the analysis’ results. Mixture analyses are based on the measurable sample of specimens from the modeled subset of all element portions; this does not mean that these results cannot produce useful information, but it does mean that we must contextualize those results by understanding how small the measured assemblage is in comparison with the modeled (or full) assemblage we are interested in describing. These extensions provide a way to do this—measured specimens will have much more certain membership probabilities than unmeasured or unmodeled specimens, owing to the information gained from their size. Thus, including unmeasured and unmodeled specimens will produce less “extreme” results (e.g., a lower probability that a majority of the assemblage is from a single group). This will be especially clear when the measured assemblage is much smaller than the modeled or full assemblage.

Computational Details of the Bayesian Analysis

The Bayesian multilevel mixture model is written in Stan, version 2.30.0 (Stan Modeling Team, 2022). All analyses in this paper use R version 4.1.2 (2021-11-01), in RStudio 2022.12.0.353 (Elsbeth Geranium) (R Core Team, 2022; RStudio Team, 2022); Table 3 lists the packages, versions, and citations for the packages used in the analytical scripts. The model Stan code and analytical R code necessary to replicate and apply the analyses in this paper are freely available on GitHub page and Open Science Framework page. The files include a copy of the Shetland sheep data file from the supplemental files published in Popkin et al., (2012) and archaeological datasets for the case studies downloaded from OpenContext (Buitenhuis, 2013; Carruthers, 2006; Galik, 2013; Gourichon & Helmer, 2013). The analytical code includes two script files—a script for replication and one for application. The R Markdown file (ZooarchMixMod.Rmd) replicates the entire analytical workflow of the paper, with a specific seed set to ensure the exact replicability of the submitted manuscript. Another set of scripts is included for the application of the model to other datasets, standardizing the analytical workflow for faunal datasets structured like the OpenContext faunal datasets used in these case studies (see the GitHub for more details at All scripts (R and Stan) are released under the MIT license, and figures are released as CC-BY to encourage reuse and reproducibility (Marwick, 2017; Marwick & Pilaar Birch, 2018).

Testing the Bayesian Multilevel Mixture Model

Two sets of tests are used to evaluate different aspects of the Bayesian multilevel mixture model. First, the accuracy of the model’s ability to reconstruct the age and sex composition of assemblages is tested using simulated faunal assemblages of known age and sex from the Shetland sheep population. This test evaluates both the single-assemblage model and the multi-assemblage model. Second, two archaeological case studies showcase the applicability of the model to archaeological data and the added insights gained from adopting Bayesian multilevel mixture models. The simulated assemblage case study and the single assemblage archaeological case study use sheep (Ovis aries) measurements, with standard measurements coming from a female wild sheep (Ovis orientalis FMC 57951: Uerpmann & Uerpmann, 1994: Table 12). The multiple assemblage case studies use cattle (Bos taurus) measurements, with standard measurements coming from a wild female aurochs (Bos primigenius “Ullerslev”: Degerbøl, 1970). Two dimensions of the standard cow (Scapula GLP: 89 mm; and Calcaneus GB: 46 mm) were not included in the “zoolog” output and were included manually, drawn from the referenced source.

Simulated Assemblages

A series of simulated assemblages of known age and sex composition are created from the Shetland sheep population by randomly drawing element portions (and all associated measurements) from the total assemblage without replacement. Table 4 describes the measured dimensions included in the simulation analyses from the 10 element portions. The first test, using a single-assemblage model, uses 150 element portions from the Shetland sheep population where every element portion has an equal probability of being selected. There is no guarantee, however, that the element portions have equal representation or even that all element portions are present in the simulated assemblage, which better approximates archaeological assemblages. The result of this first simulation produces an assemblage of 231 measurements from 125 individual animals. Using the same procedure, the second test creates 15 simulated assemblages that are analyzed in a single multi-assemblage model. Demographic observations for phalanx fusion rates and pelvis sex ratios were also simulated from the Shetland sheep population using the same underlying probabilities as the measurement assemblages. Table 5 describes the sample sizes of the measurement assemblages, including any manipulations to the measurement values. The specific elemental composition and measurements of the assemblages, along with the simulated demographic observations, used in both simulations can be recovered from the replication script with the recorded random seed (see also Supplemental Tables S1-S3); using another random seed would provide a conceptual replication of new assemblages drawn from the same underlying populations.

Table 4 Measurements included in the simulation analyses
Table 5 Group composition of the simulated measurement assemblages (element portions)

While the simulated assemblages are derived from the same Shetland sheep population that was sampled to create the ‘prior assemblage’ (see “Developing Prior Distributions from a Prior Assemblage“), there are several key justifications for this double use. First, the prior distributions used in the model differ from the results from the “prior assemblage“ (Fig. 3); prior predictive checks of the single assemblage and multisite models show that the prior distributions are flexible enough to allow a wide range of potential assemblages (see Model Supplement). Logistically (Popkin et al., 2012), the population is the most complete, fully published assemblage of standard measurements, particularly including immature, adult females, and adult males; Davis (1996, 2000) describes similar sheep but does not include any immature specimens. Finally, the simulated assemblages vary in sample size, and some assemblages are manipulated to vary in average body size and expected composition from the original Shetland sheep population. These modifications are important to try to avoid issues of “prior mimicry,” as seen in survivorship modeling (e.g., Millard, 2006); this also stresses, however, the importance of developing additional sources of “prior assemblages” to test or develop relevant prior distributions.

Rather than trying to reconstruct the exact parameter values of the simulated assemblages, parametric accuracy is focused on relating the parameter distributions of the assemblage (the sample) to the respective values in the full Shetland sheep assemblage (the population from which the sample is derived), including any relevant demographic or size modifications. In this sense, the goal is not 100% accuracy; instead, the goal is being well-calibrated, wherein credible intervals about a sample parameter contain the true population values the specified percentage of the time (e.g., 95% of a model’s 95% credible intervals contain the true population values). If too few population values are contained in the interval statements, then the model has overfitted to the sample and the posterior distributions are too narrow; a researcher may falsely distinguish between two assemblages from the same underlying population (i.e., a false positive). If too many population values are contained in the interval statements, then the model has to underfit to the sample, and the posterior distributions are too wide; a researcher may be unable to distinguish between two assemblages that derive from different underlying populations (i.e., a false negative).

Compositional accuracy does not have the same structure as parametric accuracy because there is no underlying population value for composition; there is only the true number of immature, adult female, and adult male specimens in the measured and modeled assemblages. Again, though, it is important to understand accuracy in the context of overfitting and underfitting. Overfitted results, wherein credible intervals about the number of immature, adult female, and adult male specimens contain the true abundances at a lower rate than designed (e.g., fewer than 95% of the 95% credible intervals), could lead to a researcher declaring an imbalance in the demographic composition of an element where one does not exist (or is even imbalanced in the opposite direction). Underfitted results, by contrast, would mean that a researcher is unable to identify an imbalance where one exists because the credible intervals are too wide. It is important, then, to use the simulations to understand the kinds of errors the model is prone to making so that researchers avoid overinterpretation.

Modeled assemblages were created for the single-assemblage simulation and the multi-assemblage simulation by assuming that measured specimens represent 20% of the overall assemblage and sampling more specimens from the Shetland sheep population to create the remaining 80% of the assemblage. For example, in the single-assemblage simulation with 150 measured specimens, this means sampling 600 more specimens from the Shetland sheep population to create a total modeled assemblage of 750 specimens. Specimens could not be repeatedly sampled, though multiple specimens could be from the same individual. As described in the “Interpreting Model Results: Measured, Modeled, and Full Assemblages” section, unmeasured specimens use the relevant 𝜋 parameters for the element portion. For the multi-site simulation, this potentially includes element portions where there are no relevant measurements in the specific assemblage.

Because the “grand mean” parameters in the multisite simulation no longer represent the same thing as in the single assemblage model (see Model Supplement), the prior distributions must also be changed to reflect different expectations. Again, the goal of these prior distribution definitions is to prevent extreme overfitting so that parameter estimates are biologically feasible (Gelman et al., 2008). In general, the centers of the distributions stayed the same, but the uncertainty was increased to reflect the fact that there is less certainty about biologically feasible values for multiple populations, especially if there is size variation expected between the assemblages. These prior distributions are listed in the Model Supplement.

Archaeological Case Studies

The Bayesian multilevel mixture model is applied to two archaeological case studies to showcase its utility for both interpreting a single assemblage and examining multiple assemblages. In both case studies, the sheep and cattle measurements have been previously published on OpenContext, and the general zooarchaeological summaries of the assemblages have been published as well (Buitenhuis, 2008, 2013; Carruthers, 2005, 2006; Galik, 2013; Gerritsen & Özbal, 2019; Gourichon & Helmer, 2008, 2013). Again, LSI𝑒 values are calculated using the same standard animal as the simulation analysis for the single assemblage analysis, the Ovis orientalis female standard animal (FMC 57951) from Uerpmann and Uerpmann (1994, Table 12) and the Bos primigenius female standard animal (Ullerslev: Degerbøl, 1970; Grigson, 1989), operationalized through “zoolog” functions (Pozo et al., 2022). Alongside metric data, the OpenContext faunal tables provide demographic data that can be used to observe relevant estimates of the age and sex composition of the assemblages. The goal of applying the mixture model to these assemblages, then, is to use the metric data to improve estimates of the age and sex composition of the assemblage, biometric estimates, and sex-specific fusion rates.

Single Assemblage: Biometric Analysis of Sheep from 7th Millennium BCE Central Anatolia (Pinarbaşı B)

The site of Pinarbaşı, located in the Konya Plain of central Turkey, consists of a series of rock shelters and open-air sites at the foothills of the Karadağ volcanic region and Lake Hotamiş and its associated wetlands (Baird et al., 2011; Kabukcu, 2017). This case study examines the Pinarbaşı B Late Neolithic occupation, which is dated to the second half of the 7th millennium BCE and includes a large number of domesticated sheep and goat remains (Baird et al., 2011; Carruthers, 2005). Carruthers (2005) analyzed fauna from the 1994 to 1995 excavations by Trevor Watkins (Watkins, 1996), interpreting the presence of fetal sheep remains and other juvenile remains in the assemblage as evidence for herders penning sheep on-site. The Neolithic assemblage was, thus, described as the result of seasonal occupation by sheep and goat herders during the lambing season and the fall, with culling in the spring possibly focused on young males (Carruthers, 2005). This analysis makes several claims that can be evaluated with the Bayesian multilevel mixture model: the dominance of immature remains and a female-dominated adult sex ratio.

The Bayesian multilevel mixture model for the Late Neolithic Pinarbaşı B assemblage uses 44 sheep measurements from 44 specimens (see Table 6; Supplemental Table S4). In addition to these measurements, the observed proportion of immature animals from unfused first and second phalanges is 59/62 (95%), including specimens identified as sheep and sheep/goat. There are 0 observed sheep (or sheep/goat) pelvis bones with sex identifications; this is entered into the model by having an observed adult sex ratio for the assemblage of 0/0 (females/females + males). All data come from the Pinarbaşı faunal assemblage uploaded to OpenContext, focusing only on specimens in the Site B Neolithic contexts (Carruthers, 2006). The Pinarbaşı B sheep model uses the same prior distribution definitions for the model hyper-parameters as the single assemblage simulation since both models, even though the sheep body sizes likely differ between the two populations, showcase the flexibility of the standard prior distribution definitions.

Table 6 Elemental composition of the Pinarbaşı B assemblage

Multiple Assemblages: Biometric Analysis of Cattle from 7th to 6th Millennium BCE Northwest Anatolia (Barcın Höyük, Ilıpınar Höyük, Menteşe Höyük)

Understanding the development of Neolithic communities in northwestern Anatolia has long been of interest to researchers interested in studying the spread of agricultural lifeways from southwest Asia into Europe (e.g., Çakırlar, 2013; Karul, 2019; Özdoğan, 2011, 2019). Agricultural communities first appear in the Marmara region in the mid-seventh millennium BCE in sites like Barcın Höyük (Gerritsen & Özbal, 2019; Karul, 2019). The domestic animal economies of these Late Neolithic and Early Chalcolithic communities appear to be focused on cattle and caprine (sheep and goat) herding, rather than pig husbandry (Buitenhuis, 2008; Çakırlar, 2013; Gourichon & Helmer, 2008). Milk residues on pottery recovered from these sites suggest these communities regularly consumed milk, potentially orienting herd management strategies of sheep, goats, and particularly cattle to specialize in milk production (Evershed et al., 2008; Thissen et al., 2010).

Four archaeological components from three sites are used in this case study, located near Lake İznik and on the Yenişehir Plain in the Bursa province of Turkey (Fig. 4). The Neolithic layers from Barcın Höyük (Phase VI) are the earliest of these assemblages, with occupation roughly from 6500 to 6000 cal BCE; excavations revealed a subsistence economy focused on cereal agriculture and the herding of cattle, sheep, and goat (Galik, 2013; Gerritsen & Özbal, 2019). Menteşe Höyük is located approximately 5 km west of Barcın Höyük on the Yenişehir Plain; the three Neolithic layers at the site date from 5800 to 5600 cal BCE (Gourichon & Helmer, 2013; Roodenberg et al., 2003). Previous faunal analysis of the Neolithic assemblage identified animal economies that shifted from predominantly cattle to sheep herding over the course of the occupation (Gourichon & Helmer, 2008). Ilıpınar Höyük is located near Lake İznik, separated from the Yenişehir Plain by a mountain ridge (Roodenberg, 2012a). The Neolithic/Early Chalcolithic occupation of the site spanned 6200–5400 cal BCE (Buitenhuis, 2013); the assemblage is divided into two sub-assemblages (Neolithic Ilıpınar = Phases X-VII, 6000–5700 cal BCE; Chalcolithic Ilıpınar = Phases VI-V, 5600–5400 cal BCE), marked by the introduction of mudbrick architecture and expanded storage (Roodenberg, 2012a, 2012b). Sheep and goat are common in the earlier assemblages of the site, with cattle becoming predominate in later phases of the site (Buitenhuis, 2008; Roodenberg, 2012a). Notably for this biometric analysis, Buitenhuis (2008) notes that cattle body sizes are stable throughout the site’s occupation.

Fig. 4
figure 4

Map of archaeological sites included in this analysis

The northwest Anatolian cattle bone assemblages consist of 614 measured specimens spread unevenly across the four components (Barcin Höyük N = 67, Menteşe Höyük N = 45, Neolithic Ilıpnar N = 249, Chalcolithic Ilıpnar N = 253; Supplemental Table S5). All measured Bos remains were included in the analysis, rather than separating out those identified as aurochs (Bos primigenius, N = 3) or identified only as Bos spp. (N = 134) in the Ilıpınar Höyük dataset; all specimens were only labeled as “Bos” in the Menteşe Höyük dataset. Table 7 shows the composition of the four northwest Anatolian measurement assemblages. Demographic observations of the proportion of immature animals and the adult sex ratio for each assemblage describe these parameters. For the four northwest Anatolian assemblages, estimates of the assemblage-level proportion of immature specimens based on the fusion rates of proximal and middle phalanges for cattle specimens are 28/87 (32%) for Barcın Höyük, 28/184 (15%) for Neolithic Ilıpınar, 8/25 (32%) for Menteşe Höyük, and 9/89 (10%) for Chalcolithic Ilıpınar. The observed adult sex ratios (females/females + males) based on cattle pelvis morphology are 3/4 (75%) for Barcın Höyük, 0/0 for Neolithic Ilıpınar, 0/0 for Menteşe Höyük, and 3/5 (60%) for Chalcolithic Ilıpınar. As in the Pinarbaşı B example, observations of 0/0 impart no information onto the prior distribution of the adult sex ratio. All demographic and measurement data come from the OpenContext datasets (Buitenhuis, 2013; Galik, 2013; Gourichon & Helmer, 2013); the associated RMarkdown file (“ZooarchMixMod.Rmd”) in the GitHub project includes the steps for data processing and analysis.

Table 7 Elemental composition of the northwest Anatolia cattle measurement assemblages

Previous syntheses of the Late Neolithic and Early Chalcolithic animal economies in northwest Anatolia provide several prior inferences about the age and sex structure of cattle bone assemblages that can be evaluated with the results of the Bayesian multilevel mixture model. First, the general cultural continuity of the assemblages suggests that the biometry and composition of cattle bone assemblages may be similar at the sites, having been produced by similar processes (Çakırlar, 2013; Özdoğan, 2019); Buitenhuis (2008, 312) explicitly states that there is no size change among cattle bones across the Ilıpınar assemblage. Second, the widespread evidence of milk consumption from pottery residue analyses from these sites and others in the region (Evershed et al., 2008; Thissen et al., 2010) has led some researchers to argue that cattle were managed for milk production (Gourichon & Helmer, 2008; Roodenberg, 2012a). Gourichon and Helmer (2008, 440) argue that the cattle tooth eruption and wear data at Menteşe indicate exploitation focused on milk consumption; one consequence of this pattern should be female-dominated adult sex ratios, including higher fusion rates for later-fusing elements among females than males (Zeder & Hesse, 2000). The multilevel modeling results can be used to evaluate the feasibility of these inferences by examining posterior distributions of relevant parameters and simulations of sex-specific fusion rates.

Because this application is a multisite model and deals with a different taxon than the original simulations, the prior distributions for the model hyper-parameters are redefined to reflect different expectations of biological feasibility. While the multisite simulation provides useful prior distribution definitions for most of the parameters, two other parameters (average body size of females 𝜇2 and index of sexual dimorphism log(𝛿2)) should be changed because of different expectations modeling cattle rather than sheep. The change in the prior distribution definition of 𝜇2 reflects the fact that the standard measurements for cattle come from an aurochs female (Degerbøl, 1970), which is expected to be larger than the domestic cattle females in the assemblages. Cattle are expected to be more sexually dimorphic than sheep, which is reflected in increasing the average expected value of log(𝛿2), resulting in an expectation of 0.14 LSI𝑒 units between males and females on average. This is slightly lower than the index of sexual dimorphism seen in the Degerbøl (1970) aurochs specimens (Grigson, 1989: Fig. 2, which uses LSI10; the equivalent size difference is 0.06 on the LSI10 scale), though domestic cattle may be expected to be less sexually dimorphic than their wild counterparts (e.g., Tchernov & Horwitz, 1991); these prior distribution definitions are listed in the Model Supplement.


Bayesian models that use the Monte Carlo methods, like the ones used here, rely on convergence diagnostics to ensure that the results (posterior distributions) have converged to the target distribution—that the results are not unduly affected by the random starting position of the analysis. To do this, analysts run multiple independent chains of the model—starting from different initial values—then evaluate how similar the chains are to one another using different diagnostic criteria (e.g., Gelman & Rubin, 1992). Supplemental Tables S6-S11 show the posterior estimates of the (overall and site-specific) model hyper-parameters and diagnostic criteria (R-hat and effective sample size). These results are consistent with the model successfully converging, as R-hat values are ≤ 1.01 and effective sample sizes are greater than ×100 the number of chains (Vehtari et al., 2021). Trace plots show the value of a parameter at each posterior sample, with each chain overlain on top of each other; a converged model should have no directionality across the length of the chain and the independent chains should be indistinguishable from one another. Trace plots of each model’s overall hyper-parameters are shown in Supplemental Figures S1-S4. None of the parameters show extreme deviations between chains, supporting the assertion that the model’s posterior distributions properly describe the data and prior beliefs.

Simulated Assemblages: Testing Model Accuracy

Bayesian models work by updating prior information with new data to produce posterior distributions of parameters of interest (Otárola-Castillo et al., 2022). Thus, the difference between a model parameter’s prior and posterior distribution shows the amount that the model “learns” from the data. If the data do not provide relevant information on a parameter’s potential values, then the posterior distribution will resemble the prior distribution. Figure 5 compares the prior and posterior distributions of the main model hyper-parameters for the single assemblage simulation (for prior-posterior comparisons of the other models, see Supplemental Figures S5-S7). The results show that the data provides much more information about the likely values of the two demographic parameters (the proportion of immature animals, 𝜋1, and the adult sex ratio, \(\frac{\pi_2}{\pi_2+{\pi}_3}\)) and the average 594 body size for females (𝜇2). This is largely to be expected, as the prior distribution definitions were weakly informative priors (Gelman et al., 2008), but it also shows how these choices did not appear to severely influence the resulting posterior distributions.

Fig. 5
figure 5

Comparison of prior and posterior distributions for mixture model hyper-parameters of the simulated single assemblage. The model hyper-parameters serve as assemblage-wide estimates accounting for size and composition variation across element portions.

The prior distribution definitions for the size offsets (𝛿1 and 𝛿2) and the size variability estimates (𝜎1, 𝜎2, and 𝜎3) have a lot more overlap between the prior distributions and their respective posterior distributions. This overlap stresses the importance of using a Bayesian framework, particularly one relying on informative prior distributions, to produce meaningful parameter estimates from zooarchaeological data. But it also highlights the interpretive weight given to the reference population. However, the overlap is not necessarily a drawback of the model, as again, the prior distribution definitions were designed as informative priors, specifically to ensure that the resulting parameter estimates would be biologically feasible. Furthermore, the simulated population also has the same underlying biological population (the Shetland sheep population) that was used to develop the prior distributions, so it is possible that this overlap reflects that fact.

The parametric and compositional accuracy of both simulation tests are summarized in Table 8. The single assemblage model is well-conditioned when examining parametric accuracy, though the multisite model overfits in this respect; this is driven by poor performance on size variability (𝜎) parameters—the model estimates average body size (𝜇) parameters well. The multisite model also has a tendency to underfit when examining site-specific compositional accuracy. In both models, though, the compositional accuracy improves (in the sense of no longer underfitting) by using the modeled assemblages rather than the measured assemblages. This makes intuitive sense, as the measured assemblage is itself theoretically a sample from the modeled assemblage (based on the assumption that “measurability” is random). In these simulations, of course, this theory is held to be explicitly true, though the relationship between the measured and modeled assemblage is generally held to be true implicitly in zooarchaeology and can be explicitly tested (see “Interpreting Model Results: Measured, Modeled, and Full Assemblages”).

Table 8 Parametric accuracy rates and expected accuracy ranges for the simulation models compared to the population values from which the samples were drawn

Figures 6 and 7 show the posterior distributions of the site-level parameters for the single-assemblage and multisite simulations. Figures 6A and 7A show the posterior distributions for the mixing proportions (𝜋 parameters), Figures 6B and 7B show the posterior distributions for the average body size (𝜇 parameters), and Figures 6C and 7C show the posterior distributions for the size variability (𝜎 parameters) for immature (black), adult female (blue), and adult male (red) specimens in the assemblage. Vertical bars summarize the 80 and 95% credible intervals for the parameter, while the solid horizontal lines denote the “true” parameter value for the Shetland sheep population from which the assemblage was sampled, including any deviations made for the multisite simulation (see Table 5). The model excels at estimating the average body sizes of the animal groups, even when those body sizes are manipulated (Figs. 6 and 7B). The multisite model’s estimates of the overall proportions tend to be conservative (Fig. 7A), that is, underestimating the proportions for animal groups with extremely low “true” proportions (e.g., adult males in sites 6–10, immature animals in sites 11–15) while overestimating the proportions for animal groups with extremely high “true” proportions (e.g., adult females in sites 6–10). This is likely due to the multilevel modeling structure “shrinking” estimates towards a global mean, though it may also reflect overfitting from small sample sizes. This is particularly the case when examining variability parameters (𝜎: Fig. 7C), which the multilevel model tends to underestimate.

Fig. 6
figure 6

Posterior distributions of the mixture model hyper-parameters (assemblage-level estimates) for the simulated single assemblage. A Mixture proportion (π). B Average LSI e size (μ). C Size variability (σ). Horizontal lines denote the relevant parameter values from the Shetland sheep population from which the assemblage was sampled

Fig. 7
figure 7

Posterior distributions of assemblage-level mixture model parameters for the simulated multisite assemblages. A Mixture proportion (π). B Average LSIe size (μ). C Size variability (σ). Horizontal lines denote the relevant parameter values from the Shetland sheep population from which the assemblage was sampled, including modifications to the composition and body size for some assemblages (see Table 4)

Figure 8 shows posterior distributions of simulated group-specific compositions for both the single-assemblage and total composition of the multisite models (i.e., all sites combined) alongside true counts for each group. The underfitting performance of the multisite simulation, particularly at the level of individual sites, may be a consequence of low statistical power due to small sample sizes. The median number of specimens per element portion is 4 for the measured assemblage and 17 for the modeled assemblage. By contrast, the single-assemblage simulation has a median number of specimens per element portion of 19.5 for the measured assemblage and 92 for the modeled assemblage. More important than overall sample sizes, though, is the group-specific element counts, where the contrast between the median number of specimens in the measured and modeled assemblages from the multisite simulation (measurement assemblage: 1, modeled assemblage: 4) and the measured and modeled assemblages from the single-assemblage simulation (measurement assemblage: 5, modeled assemblage: 26) is starker. The larger sample sizes in the modeled assemblages also partially explain the reduced underfitting relative to the measured assemblages.

Fig. 8
figure 8

Posterior distributions of simulated group-specific composition for the simulated assemblages. Top row: single assemblage model A measured assemblage and B modeled assemblage. Bottom row: multisite assemblage model (combined counts) C measured assemblages and D modeled assemblages. Horizontal lines denote the true group-specific composition of the relevant measured or modeled assemblage

It is also noteworthy that the additional sampling to create the multisite simulation’s modeled assemblage resulted in seven newly observed element portions at certain sites. The model can estimate the composition of these element portions due to its multilevel structure, which estimates element-specific offsets and interaction terms (𝜈Element and 𝜈Interaction) for elements that are present in at least one site. The compositional accuracy of these newly observed element portions is well-calibrated (see Table 8), despite having no observed measured specimens from the element portion for those sites. That these element portions can be accurately modeled despite lacking observed measurements for the site lends support to the idea that researchers could extend the same multilevel model structure (element-specific offsets and interaction terms) to estimate the composition of unobserved (i.e., completely unmeasured) element portions in an assemblage.

Pinarbaşı B Sheep: The Impact of Immature Specimens

The first three distributions in Fig. 9 show the posterior estimates of the assemblage-level proportion of immature, female, and male sheep at Pinarbaşı B. In general, the Pinarbaşı B sheep assemblage is overwhelmingly composed of immature animals (posterior \({\mu}_{\pi_1}\)median = 87%; 95% posterior credible interval for \({\mu}_{\pi_1}\) = 77–94%), somewhat lower than the observed fusion rate of proximal and middle phalanges (59/62 = 95%). Even though the overall proportions of female and male animals in the assemblage are low relative to immature specimens, the Bayesian multilevel mixture model can produce a posterior distribution of the overall adult sex ratio that suggests that adult females are more common than adult males (median 𝜃Female = 64%; 95% posterior credible interval for 𝜃Female = 10–96%); the wide credible interval of this distribution reflects the small proportions of mature specimens in the assemblage overall. Despite this uncertainty, we can use this distribution to estimate the probability that adult females are more common than adult males (i.e., that 𝜃Female is greater than 0.5); 68% of the posterior 𝜃Female samples are above 0.5, providing some confidence in the interpretation that the adult sex ratio is skewed towards females. Note that any threshold value could be chosen to test a hypothesis about the adult sex ratio—for instance, only 45% of the posterior 𝜃Female samples are above 0.67 (i.e., a 2:1 female-to-male ratio), providing poor support that the mature portion of the assemblage is heavily skewed towards females.

Fig. 9
figure 9

Posterior distributions of composition estimates (π) for the Pinarbaşı B sheep assemblage. The first three distributions to the left show assemblage-level estimates, while the other distributions show element-specific composition estimates

The other distributions show element-specific composition estimates for the Pinarbaşı B assemblage. Most of the element-specific proportions of immature animals (𝜋1) broadly match the overall estimates, with posterior medians over 75%. However, these element-specific distributions also have long tails extending into lower 𝜋1 values, conveying less certainty about element-specific 𝜋1 estimates relative to the assemblage-wide estimate. This likely owes to small element-specific sample sizes (the astragalus, calcaneus, and proximal phalanx have 9–10 specimens, all other element-specific samples sizes are 1–4, see Table 6) and to the presence of some element portions with lower modeled proportions of immature animals. Two element portions—the distal metacarpal and distal metatarsal–have posterior 𝜋1 median values below 75%, though again, have long tails that extend in both directions (Distal metacarpal: 𝜋1 posterior median = 66%, 95% posterior credible interval: 46–92%; distal metatarsal: 𝜋1 posterior median = 49%, 95% posterior credible interval: 32–86%). Notably, all measured specimens from these two element portions have fused distal epiphyses, meaning that the model considers it impossible for the specimens to be immature.

Figure 10 shows the same comparison (overall and element-specific distributions) for the average size of female animals (𝜇2) in the Pinarbaşı B assemblage. Average body sizes vary across elements, highlighting some allometric variation between Pinarbaşı B sheep and the standard sheep and the importance of using a multilevel model to account for allometric variation. The multilevel structure of the model provides a parameter (𝜎Element[3]) that directly estimates this variation: the posterior mean of this distribution is 0.05, with an upper 95% quantile of 0.08 on the LSI𝑒 scale. Combining all the measurements into a single LSI analysis would confound this element-level variation with differences in composition, muddling the ability to compare the biometry of Pinarbaşı B sheep with sheep from contemporary assemblages.

Fig. 10
figure 10

Posterior distributions of average LSI value for female animals (μ2) for the Pinarbaşı B sheep assemblage. The first three distributions to the left show assemblage-level estimates, while the other distributions show element-specific size estimates

Figure 11 shows the distribution of simulated compositions for immature, female, and male specimens in three Pinarbaşı B assemblages: the measured assemblage (N = 44), the assemblage of modeled element portions (N = 277), and the full sheep assemblage including five element portions that were not modeled due to lack of measurements (additional elements: proximal radius, ulna, proximal metacarpal, pelvis, and middle phalanx; total N = 428). Beyond visualizing the group-specific composition of the assemblage, the simulated compositions provide more insight into the assemblage’s formation. Five element portions in the full assemblage (astragalus, calcaneus, proximal radius, proximal fused metacarpal bones 3 and 4, pelvis) contain no mature specimens in 18% of the simulations; by contrast, no element portion has more than 2% of the simulations without immature specimens (see Table 9). Additionally, element portions vary in the probability that they contain male specimens: distal metapodials (no males in 8% of simulations) and phalanges (no males in 16% of simulations) contain males much more frequently than other element portions (no males in 39% of simulations; Table 9). While this analysis focuses on appendicular elements, these results suggest that mature animals—particularly mature males—did not enter the assemblage as complete animals but potentially as raw materials for bone tools (e.g., distal metapodials, distal tibia: Russell & Griffitts, 2013) or on skins (e.g., proximal and middle phalanges). Immature animals, by contrast, appear to have entered the assemblage as complete animals; the element portions with the highest probability of missing immature specimens may be affected more by density-mediated attrition (e.g., proximal tibia: Symmons, 2005) and identifiability of neonatal specimens (e.g., distal metapodials; see “Discussion”).

Fig. 11
figure 11

Posterior distributions of simulated group-specific composition for the Pinarbaşı B sheep (top) measured, (middle) modeled, and (bottom) full assemblages. The full assemblage includes five additional element portions compared to the measured and modeled assemblages (proximal metacarpal, pelvis, middle phalanx, proximal radius, ulna)

Table 9 Percentage of simulated group-specific compositions of the full Pinarbaşı B sheep assemblage that exclude a different population category by element portion

These results strongly reinforce the interpretation that Pinarbaşı B was used by herders as a camp where sheep gave birth, with on-site culling largely reflecting either the first seasonal cull of animals before winter or animals that died naturally in their first year of life (Carruthers, 2005; Martín et al., 2015). This method reinforces previous analyses, which is not surprising; however, the mixture modeling results go beyond these earlier interpretations by creating group-specific biometric estimates and providing a probabilistic framework to estimate a specimen’s membership into the three groups. The biometric estimates not only account for the presence of immature specimens in the assemblage but also for allometric variation across element portions. The probabilistic identifications allow analysts to simulate assemblage compositions, highlighting potential differences in the ways that bones from immature, adult female, and adult male entered the Pinarbaşı B assemblage.

7th–6th Millennium BCE Northwest Anatolian Cattle: Examining Differences Between Assemblages

Figure 12 shows the posterior distributions of average body sizes for female cattle (𝜇2) from the four analyzed assemblages. These distributions are produced from posterior samples; assemblage-specific estimates from a single posterior sample share the same relevant hyper-parameters (\({\mu}_{\mu_2}\)and 𝜎Site [3]), meaning that they covary with one another to an extent. To compare these distributions, then, a contrast is necessary to account for this potential covariation. This is done by simply evaluating the difference between two parameters (e.g., between the average female LSI𝑒 value 𝜇2 for Barcın Höyük and 𝜇2 for Neolithic Ilıpınar) in each posterior sample, shown in the right-hand panel of Fig. 12. These contrasts show that the female cattle from Chalcolithic Ilıpınar are likely smaller, on average, than female cattle from the other sites. These cattle measurements are 3–7% smaller, on average, than those from the other northwestern Anatolian sites relative to the standard animal measurements. Thus, the mixture modeling approach not only identifies a size difference that was unrecognized when using standard LSI analysis (e.g., Buitenhuis, 2008), but also reveals a diachronic trend in cattle body size that can be studied to evaluate factors like climate or herding practices that affected animal body size over time (e.g., Munro et al., 2022; Wright & Viner-Daniels, 2015). Accounting for differences in the elemental and demographic composition of different assemblages provides more accurate reconstructions of body size that allow us to critically examine patterns of animal size change in the past (e.g., Arbuckle & Kassebaum, 2021; Manning et al., 2015).

Fig. 12
figure 12

Posterior distributions of site-specific average LSI value for female animals (μ2) for the northwest Anatolian cattle assemblages. A The distributions in broadly temporal order from left to right; B site-specific contrasts for average female body size, indicating specific size differences between pairs of sites. The title describes the focal assemblage, while the color scheme for the distributions is consistent across both panels. If the contrast distribution is greater than zero, then the focal assemblage is larger than the displayed assemblage (e.g., the top-most distribution in the top panel shows that Barcın Höyük is significantly larger than Chalcolithic Ilıpınar Höyük)

In addition to identifying size differences among the assemblages, the modeling results also reveals variation in the age and sex composition of the four assemblages. Figure 13 shows the distributions of assemblage-level demographic variables—the proportion of immature animals and the adult sex ratio (the proportion of adults that are female)—for the four northwest Anatolian assemblages. The assemblages have broadly similar estimates for the adult sex ratio (right-hand panel of Fig. 13), with strong evidence that they contain more females than males. However, modeling shows significant variation in the proportion of immature animals in the assemblages (top panel of Fig. 13): the proportion of immature cattle in the Menteşe Höyük and Barcın Höyük assemblages is significantly higher than the proportions of immature cattle in either Ilıpınar Höyük assemblage. This difference between the sites could be investigated further to understand whether there are taphonomic factors that have selectively removed immature specimens from the Ilıpınar Höyük assemblage post-depositionally or whether herd management strategies differed among communities at these sites (e.g., Gillis et al., 2014, 2015).

Fig. 13
figure 13

Comparison of posterior distributions of site-specific demographic parameters (proportion of immature πImmature and adult sex ratio θFemale) for the northwest Anatolian cattle assemblages. The top panel shows the marginal plot for the proportion of immature specimens, while the right-hand panel shows the marginal plot for the adult sex ratio (in the proportion of females among mature animals)

Simulating sex-specific fusion rates for late-fusing elements (proximal femur, distal femur, proximal humerus, distal radius, proximal tibia, proximal ulna: Grigson, 1982) from the full northwest Anatolian assemblages highlights the complexities of examining sex-specific fusion rates in zooarchaeological assemblages. In each assemblage, estimates of male fusion rates are extremely uncertain, owing to the small number of estimated males in each iteration and thus large potential shifts in the denominator for fusion rates (Fig. 14). This uncertainty makes it difficult to clearly establish whether fusion rates differed between males and females; regardless, in 68% of the posterior samples female fusion rates were higher than male fusion rates for Chalcolithic Ilıpınar. These results complicate regional syntheses that tie the presence of milk residues to milk-oriented cattle management (e.g., Evershed et al., 2008); the ability to directly estimate sex-specific fusion rates allows researchers to test the validity of these exploitation models for past assemblages (e.g., Arbuckle & Atici, 2013).

Fig. 14
figure 14

Posterior distributions of simulated sex-specific fusion rates for late-fusing elements among northwest Anatolian cattle full assemblages. Included element portions are the distal tibia, distal metapodials, calcaneus, femur, proximal ulna, distal radius, proximal tibia, and proximal humerus


The simulation analyses show that the Bayesian multilevel mixture model presented here can accurately reconstruct age- and sex-specific biometry of a faunal population represented in a measured assemblage, while also producing relatively accurate estimates of the “demographic” (ontogenetic age and sex) composition of the assemblage. The archaeological applications of the mixture model demonstrate how the model can highlight meaningful variation in the composition and relative size of specimens across element portions. The results can point to potential differences in how animals entered an assemblage, as between immature and mature sheep at Pinarbaşı B, or reveal variation in demographic proportions that could highlight taphonomic differences or variation in management strategies across the sites, as in the northwest Anatolian cattle. While broadly supporting the earlier analyses of these assemblages, applying mixture models to the measurement assemblages also opened new lines of inquiry based on exploring the drivers of variation in body size and assemblage composition.

The performance of the Bayesian multilevel mixture models relies on the prior distributions, which provide constraints against overfitting and ensure that the model produces biologically reasonable parameter estimates. The prior distribution definitions in this paper were derived largely from the measurements of a herd of known-age, known-sex population of Shetland sheep (Popkin et al., 2012), though for the multisite cattle model, some of the definitions were changed based on data on European aurochsen (Degerbøl, 1970). It is important to note that prior distribution definitions can be derived from many different sources—including quantification based on one’s judgment (e.g., Gelman et al., 2008; McCarthy & Masters, 2005). This could allow researchers to use different kinds of demographic priors to adjust for their expectations about the ways that assemblages are created, for instance, if one may expect catastrophic profiles (e.g., Stiner, 1990). More important than the source of one’s prior distribution definitions is investigating the expectations of those prior distribution definitions by performing prior predictive checks as in the Model Supplement (Gabry et al., 2019; Gelman, Vehtari, et al., 2020). Furthermore, emphasis should be paid to increasing the diversity of known-age, known-sex animal populations with individual measurement data (e.g., Lebenzon & Munro, 2022; Zeder & Lemoine, 2020), which could help develop prior distributions relevant to different taxa and to understand how variable different parameters, especially size variability (𝜎) parameters, are across populations.

One of the central tenets of the mixture model’s extension to modeled assemblages is the idea that “measurability” (adequate preservation to maintain a biometric dimension) is unrelated to a specimen’s status as immature, female, or male. Variation in the mixture proportions among elements, especially the proportion of immature specimens 𝜋1 may highlight group-specific biases in the deposition of specimens but could also indicate issues with the assumption that “measurability” is random. The Pinarbaşı B sheep assemblage potentially demonstrates this issue, as the distal metapodials have much lower element-specific 𝜋1 estimates than other element portions. While metapodial bones from mature sheep could have been selectively over-represented in the assemblage, it is also likely that distal metapodials from immature animals—particularly very young animals—are less likely to be measurable compared to adult animals. Because the distal breadth measurement requires both distal condyles to be present, distal metapodial specimens from neonatal or extremely juvenile individuals may be missed while those from other element portions (e.g., proximal metapodial, distal humerus) would still be theoretically measurable (Martín & García-González, 2015). The inclusion of condyle-specific measurements could address this issue, though it would require identifying whether the isolated condyle is medial or lateral (e.g., width of condyle: Payne, 1969).

The ability to create accurate simulated estimates of age and sex composition provides many opportunities for further analyses. For instance, a comparison of the composition of animals in different depositional contexts could support contextual taphonomic analyses (e.g., Meier, 2020). Access to certain kinds of animals could highlight systems of provisioning or status-related restrictions (Arbuckle 2012; Twiss, 2019: 73–97). Differences in the ontogenetic age and sex composition of different body parts could also highlight ritual behaviors reflected in the use of certain contexts or sites (e.g., Madgwick & Mulville, 2015). In a similar vein, tracking adult sex ratios could identify the use of castrates to take advantage of secondary products like wool or labor; the models used here would identify castrates as males, though alternative prior distributions and measured dimensions could distinguish these groups (e.g., Popkin et al., 2012: Fig. 7). On a more practical level, providing specimen-specific probabilities of being immature, female, or male can provide a useful baseline for sampling strategies focused on ancient DNA or stable isotopes, allowing researchers to explore potential sex differences in diets (e.g., Post et al., 2001) or more easily identify male specimens to isolate Y-chromosomal DNA to explore sex-specific selection (e.g., Daly et al., 2021; McGrory et al., 2012).

The archaeological case studies highlight the importance of considering the presence of immature specimens and elemental variation in body size when summarizing the biometry and composition of an assemblage. Variation in the proportion of immature animals in the assemblage, as in the multisite case study for northwest Anatolia, may point to differences in culling strategies or even the seasonality of animal presence at the sites. Most of the Pinarbaşı B material derives from immature specimens, which could complicate inter-assemblage biometric analyses that do not use sex-specific size estimates (e.g., Arbuckle et al., 2014; Helmer et al., 2005). Restricting measurements only to fused specimens removes useful information, particularly when fusion rates may differ between male and female animals (Zeder & Hesse, 2000); furthermore, it does not resolve the problem of immature animals in the measurement assemblage if early-fusing elements like the distal humerus are still included because of post-fusion growth (Popkin et al., 2012). The ability to create sex-specific biometric estimates is important to document large-scale spatial and temporal dynamics in animal body size (e.g., Arbuckle et al., 2016; Arbuckle & Kassebaum, 2021; Wolfhagen et al., 2021).

Examining LSI𝑒 (or LSI10) values of Pinarbaşı B sheep without accounting for the impact of immature animals would mislead a researcher into believing that sheep were smaller, on average than sheep from contemporaneous sites (e.g., late Çatalhöyük: Baird et al., 2011). While stable isotopic analyses have suggested that the Pinarbaşı B and Çatalhöyük sheep had similar diets (e.g., Baird et al., 2011, 2018; Middleton, 2018), biometric analyses had not previously supported the idea that these assemblages derived from the same animal population (e.g., Arbuckle et al., 2014: Fig. 4). Only eight measurements from the Pinarbaşı B sheep are excluded from being immature due to fusion status, four of which are distal metatarsals; the median LSI𝑒 of these specimens is −0.12, compared to a median of −0.06 for the 362 Late Çatalhöyük sheep that must be mature based on fusion status (Wolfhagen et al., 2021). Even though the mixture model’s estimate for the average LSI𝑒 for female sheep (𝜇2) at Pinarbaşı B is relatively uncertain (95% credible interval: −0.14 – −0.05 LSI𝑒), it still provides a useful reference point for comparison that supports the idea that these assemblages have similar biometrics (95% credible interval of 𝜇2 for late Çatalhöyük sheep: −0.11 – −0.06 LSI𝑒: Wolfhagen et al., 2021).

The increased ability to specify the ontogenetic age and sex composition of faunal assemblages with Bayesian multi-level mixture models also highlights the limitations of our current language used to describe and interpret these compositions. Results like the sex-specific fusion biases in northwest Anatolian cattle complicate straightforward expectations of clear sex-specific patterns in archaeological assemblages, which is perhaps unsurprising. Many discussions that examine changes in the composition of faunal assemblages to identify shifts in exploitation patterns use terms like “prime-dominated age structure” (e.g., Stiner, 1990), “dominance of females,” (e.g., Peters et al., 2005 or “young male slaughter/kill-off” (e.g., Arbuckle & Atici, 2013; Zeder & Hesse, 2000). These terms are deceptive in their utility—they describe some empirical pattern but are ordinal-scale; thus, it is up to the individual researcher to define the cut-off between a “dominant” and “non-dominant” assemblage (Wolverton et al., 2016). In the case of the adult sex ratio for Barcın Höyük, 98% of the posterior samples are above 60% (1.5 females:1 male), but only 68% of the posterior samples are over 75% (3 females:1 male). Meanwhile, 79% of the posterior samples for the adult sex ratio for Neolithic Ilıpınar are over 80% (4 females:1 male). Are both assemblages “dominated by females”? More formalized language in our hypotheses—or, rather, the adoption of statistical modeling frameworks (McElreath, 2020: 4–17)—is necessary to clarify what changes in assemblage-level estimates of biometry and composition mean for past human-animal interactions.


This paper describes a new method for estimating the biometry and ontogenetic age and sex composition of faunal assemblages based on standard measurement data using Bayesian multilevel mixture modeling. The model produces accurate estimates of sex-specific biometry, which can provide a more useful framework for inter-assemblage analysis (e.g., Arbuckle et al., 2016; Helmer et al., 2005). Such a framework could better explore broad spatial and chronological patterns in animal biometry while accounting for differences in assemblage composition across the assemblages, ensuring reliable comparisons of animal body size in relation to other variables. These analyses could investigate the processes behind size fluctuation in animals, particularly in relation to changing human-animal interactions and adaptation to new lifeways and anthropogenic environments.

Furthermore, the estimates of the age and sex composition of the assemblage can be used to simulate assemblages of specimens with known group assignment (immature, female, and male). These simulations are the baseline for comparing differences in the composition of sub-assemblages. Using these simulations allows researchers to make testable statements about the composition of the assemblage and to directly test hypotheses about differences in the age and sex composition of animal bones from different parts of a site, different fusion groups, or other categories. The Bayesian structure of the model allows researchers the flexibility to create hypotheses that can be tested directly, rather than relying on null hypothesis testing for inference (Otárola-Castillo et al., 2022; Otárola-Castillo & Torquato, 2018). Thus, the mixture modeling framework described here provides a foundation for biometric and compositional analyses that operate at multiple scales and presents a new avenue for summarizing and comparing zooarchaeological assemblages.