Introduction

The estimation of skeletal sex requires an understanding of the differences in skeletal architecture relative to distinct sex-specific biological processes and evolutionary theory, including musculoskeletal loading [1], pubertal growth trajectories, and sexual selection [2, 3]. In modern forensic practice, the skull is considered the second-most popular region for skeletal sex estimation [4]. Amongst the most frequently applied morphoscopic standard are the five cranial traits associated with the Walker [5] standard. Although originally published in Standards for Data Collection from Human Skeletal Remains [6], its subsequent republication expanded on the scoring-based system by introducing quantifiable measures of accuracy though discriminant functions derived from logistic regression [7,8,9].

The incorporation of quantifiable statistics into the estimation of skeletal sex in Walker [5] was an important step taken to address a major perceived disadvantage of morphoscopic standards, namely the reduced reliability and subjectivity when scoring traits based off visual assessments in comparison to using metric analysis [10, 11]. The improved statistical presentation of this morphoscopic approach has further strengthened its popularity among practitioners in both routine medicolegal casework and disaster victim identification (DVI). The preferential use of morphoscopic assessments has long been based on the ease of use of morphoscopic standards, which in comparison to metric analyses, do not require any specialised equipment and can be utilised with incomplete and fragmentary skeletal remains [4, 12].

Inter-population studies have assessed the applicability of the Walker [5] standard, with several reporting lower classification accuracies for females in a Greek population, at 53.57% [13], and also for females in an Italian population, at 78.6% [14], compared to 86.4% in Walker [5]. A South African study also reported reduced classification accuracies, with females classified below chance (i.e., < 50.0%) [15]. Japanese and Thai populations tested in Tallman [16] reported large misclassifications in both females and males, with accuracies as low as 26.3% and 30.2%, respectively, further emphasising the importance of utilising standards specific to their relevant population [12].

Clinical digital imaging, including computed tomography (CT) scans, has been validated for use with the Walker [5] standard. When tested on a Turkish population, Dereli et al. [17] reported classification accuracies between 83.3% and 86.1% for females and 92.9% and 100.0% for males by three observers. However, scorings for this study were done using Standards [6] without applying the regression functions in Walker [5]. When tested on a mixed US population group consisting of African, Asian, European, Latin, and Native Americans, Kelley and Tallman [18] reported classification accuracies between 84.2% and 92.1% for females and 77.1% and 93.2% for males, depending on the population-specific or pooled-population function derived from their study. These studies have allowed researchers to access more contemporary sampling as secular variation is known to affect the accuracy of skeletal sex estimates [19]. It also demonstrated the viability of virtual samples as a valuable alternative to physical skeletal collections.

Although there appears to be growing interest in the development of standards for use in Asia more broadly, these developments are limited by access to skeletal collections [20,21,22]. Cultural and religious restrictions related to invasive autopsies and the handling of deceased individuals must be taken into consideration [23, 24]. With the relatively high frequency of natural and human-induced mass fatality events in Indonesia [25, 26], standards that can be applied to fragmentary remains domestically are lacking.

The inclusion of CT scanning in the research design is two-fold. First, it adheres to and respects the cultural and religious significance of the deceased. Secondly, it allows forensic practitioners to scan and preserve physical skeletal material in a 3D environment, which can then be sent to other practitioners, especially in instances of DVI, significantly improving the speed and likelihood of positive identification [27].

As such, the aims of the present study are to evaluate the accuracy of skeletal sex estimation using the Walker [5] standard and other Asian-derived predictive models in Tallman [16], and to develop forensically applicable predictive models for the estimation of skeletal sex optimised for the contemporary Indonesian population.

Materials and method

Study sample

A total of 200 multi-slice CT scans (MSCT) were analysed for this study, comprising 87 female and 113 male individuals. Ages ranged from 15 to 76 years (female: mean age = 43.6 years, SD = 13.4 years; male: mean age = 40.8 years, SD = 15.2 years). Figure 1 shows the age and sex distribution of the sample.

Fig. 1
figure 1

Frequency distribution of the Indonesian sample by sex and age

The scans were obtained from the Picture Archiving and Communications System (PACS) database at the Dr Wahidin Sudiohusodo General Hospital (RSWS) within Hasanuddin University in Makassar, from patients who presented at the hospital for radiological examination as part of their normal course of treatment between February 2020 and August 2022. All CT scans were anonymised through PACS prior to receipt, except for recorded age and sex of individual patients at the time the scans were taken.

Imaging was performed with a Siemens Healthineers SOMATOM go.Top 128-slice CT scanner, with resolutions between 0.6 and 1.5 mm (58.0% of all scans are 1.0 mm). Scans that showed signs of acquired or congenital pathology, or other abnormalities that would obscure the observation of traits in the skull, were excluded from this study and therefore not included in the total sample count.

Approvals were provided by the Human Ethics division of the Office of Research at the University of Western Australia (2021/ET000377) and the Office of the Director-General of Health Sciences from the Ministry of Health, Republic of Indonesia, through Hasanuddin University (LB.02.01/2.2/6807/2022).

Visualisation and assessment

Visualisation was performed using the “3D volume rendering” function in OsiriX® version 13.0.1. To orientate each scan to its respective views for assessment, both the “3D rotate” and “pan” functions were used; an example 3D visualisation is shown in Fig. 2. The “High Contract” 3D preset was used in the volume rendering window. CLUT was set to “VR Muscles-Bones”, and no convolution filters were applied. Following Walker [5], scores were then assigned for each of the five traits, including any bilateral traits: glabella (GLA), mastoid process (MAS), mental eminence (MEN), nuchal crest (NUC), and supraorbital margin (SUP). The virtual representations of the five assessed features are depicted in Fig. 3.

Fig. 2
figure 2

3D visualisation of the skull in (a) left lateral and (b) anterior view

Fig. 3
figure 3

The five cranial traits used in the Walker [5] standard visualised in a 3D environment: glabella (GLA), mastoid process (MAS), mental eminence (MEN), nuchal crest (NUC), and supraorbital margin (SUP)

Each cranial trait is scored individually on an ordinal five-point scale from 1 indicating “minimal expression” to 5 indicating “maximal expression.” To aid in scoring, visual illustrations and written descriptions were provided. No associations with masculinity or femininity were noted for any of the scores in the 2008 paper, although such associations did appear in Standards [6]: traits scored as 1 or 2 were assigned as female; a score of 3 was ambiguous; and scores 4 or 5 were male. The derivation of composite scores in Standards [6] assumed that all cranial traits were weighted equally. Given the latter assumption is unlikely biologically robust, traits were weighted against their individual discriminatory power in Walker [5].

Statistical analysis

All statistical analyses were performed on IBM SPSS Statistics version 29.0.0.

Intra-observer agreement

A subset of 50 MSCT scans were used to quantify intra-observer agreement: 24 female (mean age = 44.3 years, SD = 13.6 years) and 26 male (mean age = 44.8 years, SD = 16.0 years). These scans were assessed by the lead author a total of three times; repeat assessments were performed with an interval of at least 24 h. For all bilateral traits, only the left side was scored. Agreement was calculated using the intraclass correlation (ICC) statistic; interpretation and its associated values are as followed: poor (ICC < 0.50), moderate (0.50 ≤ ICC ≤ 0.75), good (0.75 < ICC ≤ 0.90), and excellent (ICC > 0.90) [28].

Bilateral asymmetry

Previous studies utilising the Walker [5] standard did not include assessments of bilateral asymmetry (e.g., [15, 18]), implying that these authors followed scoring procedures from the original publication, or that the left side was used unless damaged or missing (e.g., [8, 16]). The present study included assessments of bilateral traits to compare the differences in scoring and its potential effects in estimating skeletal sex. Asymmetry was assessed using a Wilcoxon signed-rank test (Z).

Trait score distributions

Trait score frequency distributions are calculated and sorted by sex, with differences in sex-based distributions assessed using the Mann–Whitney U test. Sex classification accuracies based on trait frequency distributions are derived from the probability of obtaining a specific score for a single trait. This probability is calculated from the proportionality of scores assigned against the distribution. In this instance, the probability of an individual being classified as female is calculated using the following equation:

$$pf=\frac{\mathrm{\% }female}{\mathrm{\% }female+\mathrm{\% }male}$$

where % female and % male are the proportions of females and males assigned a specific score for the trait being assessed. The probability of an individual being classified as male would be calculated based on the following equation:

$$pm=1-pf$$

If pf > 0.50, the individual is likely female; if pm > 0.50, the individual is likely male.

Validation of Walker (2008) and Tallman (2019) functions

Nine multivariate functions from Walker [5] are applied to the Indonesian sample to assess the validity of this standard developed on European and Native American populations in the Indonesian sample. Moreover, fifteen functions incorporating the Japanese, Thai, and a pooled population group in Tallman [16] are applied to analyse the performance of Asian-derived predictive models and their suitability for forensic application in Indonesia. Differences in performance are assessed by comparing classification accuracies and sex bias values of each function. Sex bias values are calculated as the difference between the female and male classification accuracies; positive values overestimate females while negative values overestimate males.

Univariate and multivariate models for the indonesian population

Univariate predictive models only include one trait as the independent variable, while multivariate predictive models incorporate various trait combinations based the absence of one or more traits if the skull were fragmented (e.g., missing mandible, and therefore no mental eminence to score). To derive and test univariate and multivariate predictive models specific to an Indonesian population, 75% of the sample are randomly selected for training, with the remaining 25% used as a hold-out for validation. Sex is coded as 0 for females and 1 for males. Binary logistic regression (BLR) was utilised to derive these predictive models. Sex-specific and total classification accuracies, posterior probabilities, and sex bias values are calculated based on the trait combinations used; training and validation accuracies are provided.

Results

Intra-observer agreement

The ICC estimates and 95% confidence interval (CI) for intra-observer agreement on each trait was based on a single-rating, absolute agreement, 2-way mixed-effect model. The glabella had excellent agreement, ICC = 0.926, 95% CI [0.885, 0.954]. The nuchal crest, ICC = 0.863, 95% CI [0.792, 0.914], mastoid process, ICC = 0.798, 95% CI [0.701, 0.872], and supraorbital margin, ICC = 0.762, 95% CI [0.653, 0.848], had good agreement. The mental eminence had moderate agreement, ICC = 0.730, 9% CI [0.610, 0.825].

Bilateral asymmetry

Wilcoxon signed-rank tests reported no significant differences in the assignment of scores for the mastoid process, Z = –1.26, p = 0.208, and supraorbital margin, Z = –1.40, p = 0.161. Consequently, scores recorded from the left side only were used for all subsequent statistical analyses of these traits.

Trait score distributions

Sex-based variations in trait score distribution frequencies were significantly different for all traits: glabella, U = 10.17, p < 0.001; mastoid process, U = 8.02, p < 0.001; mental eminence, U = 5.81, p < 0.001; nuchal crest, U = 5.83, p < 0.001; supraorbital margin, U = 3.99, p < 0.001. Cranial trait score distributions for the Indonesian population are provided in Table 1. All score expressions were recorded for each trait, except for the mental eminence, for which no female had a recorded score of 5. In general, lower trait scores are associated with females, while higher traits scores are associated with males. Males were also observed to have a wider range in cranial trait expression compared to females.

Table 1 Frequency distribution of cranial trait scores and their proportions sorted by sex

Probability values associated with each character state per trait are presented in Table 2. Based on these probability values, the glabella, mastoid process, and mental eminence had scores of 1 and 2 represent female individuals, while scores from 3 to 5 represented male individuals. By comparison, nuchal crest scores of 1 represented female individuals, and 2 to 5 represented male individuals. In contrast, supraorbital margin scores from 1 to 3 represented female individuals, and scores from 4 to 5 represented male individuals.

Table 2 Probabilities associated with sex classification based on assigned trait score. See Table 1 for trait score frequency distributions

Table 3 summarises differences in cranial trait score distributions between the Indonesian population and the five population groups in Walker [5] and Tallman [16]. When compared against the American/English and Native American populations in Walker [5], the mental eminence was noted to be more gracile than the Native American population group, while the nuchal crest was more gracile than both the English/American and Native American population groups. In contrast, the supraorbital margin was more robust in the Indonesian population compared to the two population groups in Walker [5].

Table 3 Differences in cranial trait score distributions based on probability of sex classification sorted by trait, with population groups from Walker [5] and Tallman [16]

Validation of Walker (2008) and Tallman (2019) functions

The nine functions (W1 to W9) provided in Walker [5] were applied to the Indonesian population as reported in Table 4. Performance varied depending on the function used. The most accurate function was W1, originally developed for an American/English population, with classification accuracies of 87.4% in Indonesian females and 88.5% in Indonesian males, compared to 86.4% and 88.4% as reported by Walker [5], respectively. Function W5, also developed for an American/English population, had the lowest classification accuracy for Indonesian females at 44.8% and the highest classification accuracy for Indonesian males at 92.0%. The sex bias values in the Indonesian population ranged from –47.2% in Function W5 to 27.4% in Function W8, while Walker [5] reported sex bias values from –4.5% to 13.4%. All American/English functions (W1 to W6) misclassified females (i.e., negative sex bias). In contrast, all Native American functions (W7 to W9) misclassified males (i.e., positive sex bias).

Table 4 Classification accuracies and sex biases of the nine Walker [5] functions, applied to the Indonesian population

Seventeen functions (Japanese: J1 to J5; Thai: T1 to T5; and pooled Japanese/Thai: P1 to P7) in Tallman [16] were also applied to the Indonesian population and their results are detailed in Table 5. The most accurate function was T2, with classification accuracies of 72.4% in Indonesian females and 93.8% in Indonesian males, compared to 76.3% and 89.6% in Tallman [16], respectively. The least accurate functions were T1 and P7, with classification accuracies of 50.6% in Indonesian females and 99.1% in Indonesian males, compared to 86.7% and 84.1% in Function T1, and 72.7% and 91.7% in Function P7, respectively. All functions had overly large negative sex bias values that were greater than those reported in the original study.

Table 5 Classification accuracies and sex biases of the seventeen Tallman [16] functions, applied to the Indonesian population
Table 6 Indonesia-specific functions and their associated training and validation classification accuracies, along with sex bias

Univariate Indonesian-specific predictive models

Five univariate predictive models (S1 to S5) derived for each cranial trait are detailed in Table 6. Function S1 (glabella) had the highest classification accuracy in the training sample, with 89.4% for females and 85.7% for males. Function S3 (mental eminence) had the lowest classification accuracy in the training sample at 62.1% for females. In the validation (i.e., hold-out) sample, Function S1 had the highest classification accuracy of 90.5% for females and 75.9% for males, while Function S5 (supraorbital margin) had the lowest accuracy of 57.2% for females and 41.4% for males. The sex bias values in the training sample ranged from –9.3% in Function S3 (mental eminence) to 3.7% in Function S1 (glabella). For the validation sample, the sex bias was lowest at –7.9% for Function S3 (supraorbital margin) and largest at 18.9% for Function S2 (mastoid process).

Multivariate Indonesian-specific predictive models

Nine multivariate predictive models (M1 to M9) are also detailed in Table 6. Function M9, which incorporated all traits, had the highest training classification accuracy at 89.4% for females and 89.3% for males. Function M2 (mastoid process and supraorbital margin) had the lowest training classification accuracy at 77.3% for females and 85.7% for males. In the validation sample, Functions M6 (glabella, mastoid process, and mental eminence) and M9 had the highest classification accuracies at 95.2% for females and 82.8% for males. Functions M2 and M5 (mastoid process, nuchal crest, and supraorbital margin) had the lowest classification accuracies at 61.9% for females and 65.5% for males. The best performing function overall (i.e., high classification accuracy and low sex bias) was M8, which included all traits except the nuchal crest, had classification accuracies of 85.7% for females and 82.8% for males. Sex biases values for the training sample ranged from –9.3% in Function M5 to 3.7% in Function M1. For the validation sample, sex bias was smallest at –3.6% in Functions M2 and M5 and largest at 14.6% in Function M1.

Discussion

This aim of the present study was to evaluate the effectiveness of the Walker [5] and Tallman [16] BLR functions in a contemporary Indonesian population. Based on the results presented, classification accuracies were lower than those of the original publications and demonstrated the need for population-specific models to ensure high classification accuracies are maintained. Fourteen functions derived from the Indonesian population provide classification accuracies that are comparable to those found in Walker [5] and Tallman [16]. These functions would therefore serve to improve the capabilities of forensic practitioners in Indonesia for both routine medicolegal casework and DVI.

Observer agreement

Intra-observer agreement for all five traits was moderate to excellent, with similar results reported in other studies [5, 9, 16]. As with any morphoscopic standard, visual illustrations and written descriptions should be clear and concise to minimise ambiguity in trait scoring [29]. Some descriptors in Walker [5] involve tactile interactions with the feature (i.e., mental eminence and supraorbital margin). Inter-observer agreement was not considered for this study as the Walker [5] standard is amongst the most widely used by forensic practitioners [4]. There is thus considerable published data showing acceptable levels of inter-observer agreement when applied in both physical [9, 13,14,15] and digital modalities [17].

With the use of digital 3D volume-rendered CT scans, this tactile approach is not feasible and would require modifying descriptors to more appropriately reflect the visual observations made. Dereli et al. [17] likewise noted the lack of tactile interactions and suggested incorporating other identifying features as an alternative. To address these issues, instructions for the examination of cranial traits in a digital 3D environment were provided in the software package MorphoPASSE developed by Klales [30] to aid observers in scoring.

The mental eminence appears to be a contentious trait in skeletal sex estimation due to its low observer agreement. Several studies report the mental eminence having the lowest observer agreement [9, 31]. As such, these authors have suggested the mental eminence not be considered. However, other studies have demonstrated more favourable observer agreements for the mental eminence (e.g., [16,17,18]), including the present study. In relation to the latter, the orientation of the mental eminence within the 3D environment in lieu of tactile feedback allowed for more consistent scoring (see Fig. 3).

Variation in trait score distribution between population groups

As detailed in Table 3, when assessing the trait score distributions from the Indonesian population against the Japanese, Filipino, and Thai population groups in Tallman [16], the glabella and supraorbital margin were more robust in the Indonesian population, while the mastoid process and nuchal crest were more gracile. Similar trait score distributions were observed only for the mental eminence. These observed variations support other studies that highlight the importance of developing and utilising population-specific standards, since different population groups display varying degrees of sex-based differences even if they originate from the same continent [8, 16].

Applying the nine functions of Walker [5] to the Indonesian sample expectedly resulted in different classification accuracies to the original, with differences ranging from –33.1% to 5.6% in females and from –23.2% to 14.6% in males, depending on the function used. All American/English functions had larger male sex bias values, while the Native American functions had larger female sex bias values. Similarly, testing the Japanese, Thai, and pooled-population functions by Tallman [16] resulted in classification accuracies with similarly large differences from –36.1% to –3.9% for females and from 4.2% to 23.2% in males. All functions had larger female sex biases, regardless of the population group being tested, including the pooled population.

It is evident on the basis of cranial trait distributions in the Indonesian population compared to the Filipino, Japanese, and Thai populations, that there are marked distribution differences in four of the five traits. These results show the variation in cranial trait expression across population groups, as suggested by other similar studies [9, 15, 16]. Such variations can be attributed to a variety of intrinsic factors such as diet and socioeconomic status [32], and extrinsic factors, including geographic locale and the environment [33].

Indonesian-specific predictive models

A total of 14 predictive models, five univariate and nine multivariate, were derived based on the Indonesian training sample. For univariate functions, the glabella performed best in the training subset, with classification accuracies at 89.4% for females, 85.7% for males, and a sex bias of 3.7%. It was also the best performing in the validation subset, with classification accuracies at 90.5% for females and 75.9% for males. However, its 14.6% sex bias is not within acceptable limits (i.e., absolute difference greater than 5.0%). All univariate models in this study had total classification accuracies below 85.0% and sex biases outside of acceptable limits. Therefore, use of these models in forensic practice is not recommended.

Multivariate functions had training classification accuracies from 77.3% to 89.4% for females, 85.7% to 89.3% for males, and sex biases from –9.3% to 3.7%, depending on the combination of traits used. Likewise, validation classification accuracies ranged from 61.9% to 95.2% for females, 65.5% to 82.8% for males, and sex biases from –3.6% to 14.6%. The best performing trait combination on the training subset, Function M9, included all traits as variables, with classification accuracies at 89.4% for females, 89.3% for males, and a sex bias of 0.1%. The validation subset, Function M8, included all traits except the nuchal crest as variables, with classification accuracies at 85.7% for females, 82.8% for males, and a sex bias of 2.9%. The sex bias for both these functions are within acceptable limits.

The selection of multivariate functions is dependent on the availability of traits for scoring. As the likelihood of recovering fragmentary cranial elements is high, the trait combinations presented in this study are region-specific. For example, Function M7 would be appropriate if only the mandible were missing, and therefore the mental eminence cannot be scored. Likewise, Function M1 would be appropriate when only the frontal bone is recovered, and as such the glabella and supraorbital margin may be scored.

Figure 4 visualises the differences in classification accuracies when Function W1 in Walker [5] and P6 in Tallman [16] are applied to the Indonesian sample, and M4 being derived from the study sample. All three functions include the same traits as variables: the glabella, mastoid process, and mental eminence. It demonstrated the improvements in classification accuracy when population-specific functions are applied, which further supports the need for predictive models developed for the population group being assessed.

Fig. 4
figure 4

Sex-specific classification accuracies derived from functions with predictors glabella, mastoid process, and mental eminence presented in Walker [5], Tallman [16], and this study when applied to the present Indonesian sample

Study limitations

As Indonesia is an island archipelago with a long and rich history, it is important to acknowledge the movement of different subpopulations both into and within the country [34, 35]. The application of the presented models to the broader Indonesian population must be tested. As is the case in the evaluation of cranial sex estimation standards in other large and geographically disparate populations in countries like Australia [36, 37], it is likely that adjustments to these models will be needed when larger samples from different regions of Indonesia become available for analysis in the future.

Conclusion

This study has validated the use of the Walker [5] standard for sex estimation in a sample of Indonesian cranial CT scans. By testing the application of BLR functions from the original Walker [5] standard and from other Asian-derived populations outside of Indonesia, this study demonstrated their limitations when applied to the Indonesian sample. Consequently, the development of Indonesian-specific models improved classification accuracies for use in modern forensic practice. Although there are regional studies that have focused on the utility of CT scans for metric data (e.g., [36]), the inclusion of morphoscopic standards into Indonesia’s forensic anthropological literature further expands the toolkit available to forensic practitioners in that jurisdiction. This will improve their capabilities, providing access to a readily deployable standard for skeletal sex estimation that does not require expensive specialised equipment.