Machine learning and discriminant function analysis in the formulation of generic models for sex prediction using patella measurements

Sex prediction from bone measurements that display sexual dimorphism is one of the most important aspects of forensic anthropology. Some bones like the skull and pelvis display distinct morphological traits that are based on shape. These morphological traits which are sexually dimorphic across different population groups have been shown to provide an acceptably high degree of accuracy in the prediction of sex. A sample of 100 patella of Mixed Ancestry South Africans (MASA) was collected from the Dart collection. Six parameters: maximum height (maxh), maximum breadth (maxw), maximum thickness (maxt), the height of articular facet (haf), lateral articular facet breadth (lafb), and medial articular facet breath (mafb) were used in this study. Stepwise and direct discriminant function analyses were performed for measurements that exhibited significant differences between male and female mean measurements, and the “leave-one-out” approach was used for validation. Moreover, we have used eight classical machine learning techniques along with feature ranking techniques to identify the best feature combinations for sex prediction. A stacking machine learning technique was trained and validated to classify the sex of the subject. Here, we have used the top performing three ML classifiers as base learners and the predictions of these models were used as inputs to different machine learning classifiers as meta learners to make the final decision. The measurements of the patella of South Africans are sexually dimorphic and this observation is consistent with previous studies on the patella of different countries. The range of average accuracies obtained for pooled multivariate discriminant function equations is 81.9–84.2%, while the stacking ML technique provides 90.8% accuracy which compares well with those presented for previous studies in other parts of the world. In conclusion, the models proposed in this study from measurements of the patella of different population groups in South Africa are useful resent with reasonably high average accuracies.


Introduction
Prediction of sex from recovered or discovered bones in human identification is an important first step taken by forensic anthropologists to reduce the number of possible matches by 50% [1]. This process, in conjunction with the estimation of age, stature and population affinity, is essential in the establishment of the identity of an individual from skeletons. Some bones like the skull and pelvis display distinct morphological traits that are based on shape. These morphological traits which are sexually dimorphic across different population groups have been shown to provide an acceptably high degree of accuracy in the estimation of sex [2]. While most earlier researchers and some lately [3] have focused on the use of description of the observed morphological traits, considered to be a subjective method which requires many years of experience, attention has been shifted recently to the quantification of the differences in shape that are observed on bones. These quantifications can be performed objectively using various morphometric techniques such as including and not limited to geometric morphometrics [4][5][6][7][8][9][10][11].
In the absence of the pelvis and the skull which display obvious morphological differences between males and females, size differences which are present in most bones of the postcranial skeleton can also be used for sex prediction. This metrical approach can also be used on incomplete or fragmentary remains. Standard measured parameters of different bones of the skeleton which can be easily reproducible have been analyzed through the use of various statistical methods including and not limited to logistic regression and discriminant function in different population groups. It is a well-established fact that these equations are populationspecific and as such should be limited in their application to only population groups for which they were formulated to obtain acceptably high classification average accuracies. This has led to the generation of population-specific discriminant function and logistic regression equations for measurements of the skull [12][13][14][15][16][17], bones of the vertebral column [18,19], pelvis [20][21][22], long bones of the upper [23][24][25][26][27][28][29] and lower extremities [30][31][32][33][34][35][36], and hand and foot bones [37][38][39][40] in different population groups of the world with acceptably high classification rates.
Similar population-specific local standards have also been established in South Africa for the prediction of sex from dimensions of the skull [37,38]) and postcranial bones [39][40][41][42][43][44][45][46]. Most of these equations have been derived from data collected from samples of bones of South Africans of European and African descent, which are housed mainly in the Raymond Dart collection of human skeletons [47], Pretoria bone collection [48], and UCT osteological collection [49]. Recently, successful attempts have also been made to formulate population-specific equations for Mixed-Ancestry South Africans or colored [43,[50][51][52]. While discriminant function analysis has been widely used in sex estimation using various bones of the human skeleton in South Africa, no previous attempts have utilized other novel techniques such as machine learning algorithm for that purpose.
Over the past decade, machine learning (ML) algorithms have become increasingly integrated into clinical predictive modeling, e.g., in prognostic models using health data [53][54][55]. Recent reviews have also highlighted the high interest in ML approaches for clinical guidance, as well as the necessity for more prognostic studies [56]. While there is a significant rise in interest in ML in health care, only a few studies have evaluated its capability of outperforming conventional statistical models (CSMs) in terms of predictability or not. ML rapidly examines continuously expanding datasets and enables the identification of patterns and trends that may not be directly visible to clinicians [57]. Other advantages of ML are its flexibility, it is nonparametric, requires no data model for the probability distribution of the outcome variable, requires no pre-specification of covariates, and it can process large input variables simultaneously [58,59]. In a clinical context of predicting mortality from gastrointestinal bleeding, a systematic review demonstrated higher c-indices and predictive capacity of ML than clinical risk scores [60]. Another study aimed at predicting bleeding risk following percutaneous coronary intervention found that ML characterized bleeding risk better than a standard discriminant analysis model [61]. Likewise, ML vs CSMs using the TOPCAT trial dataset showed that ML methods presented higher c-indices than CSMs for readmission (0.76 vs 0.73) and predicting mortality (0.72 vs 0.66) [62].
Osteometric variations between population groups have necessitated the need to propose population specificity standards for human identification. In addition, each of these groups exhibits and expresses sexual dimorphism to various degrees. Some authors have observed major flaws in the development and application of population-specific standards for the prediction of sex [63] and stature [64,65] leading to the proposal and recommendation for use of generic equations. This study thus aims to (1) formulate generic models for sex prediction using measurements of the patella of South Africans of African (SAAD) and European descent (SAED), as well as Mixed Ancestry South Africans (MASA) (2), and compare the classification rates obtained from the generic models using linear discriminant analysis with those obtained from machine learning algorithms.

Materials and methods
The Human Research Ethics Committee (Medical) of the University of the Witwatersrand, Johannesburg, South Africa granted an ethical clearance waiver (Ethics Waiver Number: W-CJ-140604-1) before the commencement of this study. Data were collected from a sample of patella of Mixed Ancestry South Africans (MASA). Additional data analyzed in the current study were obtained from previously published studies on sexual dimorphism of measurements of patella of South Africans of European descent (SAED) [46] and South Africans of African descent (SAAD) [66]. The sample distribution for the data used in the current study is as follows: SAAD (50 males and 50 females), SAED (50 males and 50 females), and MASA (30 males and 30 females). The birth dates range from 1999 to 2017. The source of data was the Raymond A. Dart collection of human skeletons, considered one of the largest collections of human skeletons in the world [47]. It is located in the School of Anatomical Sciences of the University of the Witwatersrand, Johannesburg, South Africa. The patella belonged to individuals whose age at death ranged between 25 and 79 years and whose birth years were between 1928 and 1991. Patella with any pathological features like osteophytic lipping, lesions, or any other obvious deformities were excluded from this study. Six parameters were measured on each patella. These are maximum height (maxh), maximum breadth (maxw), maximum thickness (maxt), the height of articular facet (haf), lateral articular facet breadth (lafb), and medial articular facet breath (mafb). These measurements have been described in previous studies [46] and are illustrated in Fig. 1. Lin's [67] concordance correlation coefficient of reproducibility was used for the assessment of intraobserver error. It has been shown that this method assesses the agreement between the test and retest measurement and is considered as a measure of prevision of the measuring technique.

Statistical analysis
Statistical analysis was performed using the Stata/MP 13.0 software. SPSS version 23 software program was used for the linear discriminant analysis of data. Sex differences were described using numbers and percentages. The number of missing data, mean, standard deviations, median, and quartiles (Q1, Q3) for combined data for all measurements from SAED, SAAD, and MASA were calculated separately for each sex. In univariate analysis, the Rank sum tests were used and performed for all variables. A statistically significant difference was defined as a P value < 0.05.

Discriminant function analysis
Stepwise and direct discriminant function analyses were performed for measurements that exhibited significant sex differences. The "leave-one-out" classification procedure was then used to evaluate the validity of the functions. In this procedure, each case in the sample is classified using the function that is generated without it. Then, generic stepwise and direct discriminant functions with acceptably high average accuracies were selected. Each of the generic functions selected was used to predict sex for each case in samples of SAED, SAAD, and MASA. The average accuracies in correct sex prediction for each of the functions were calculated for each population group separately.

Machine learning-based analysis
Six patella measurements were present in the dataset that were evaluated to determine the Pearson correlation among them. Figure 2 shows the heatmap of correlation, and it was found that none of them is highly correlated to the other. A maximum correlation of 0.81 was found between maxb and lafb. However, the threshold of removing highly correlated features was considered r > 0.85 and therefore, none of the features was removed for the next phase of the investigation.

Data normalization
The accuracy of the machine learning models is dependent on the quality of the input data for achieving generalized performance. This involves data normalization that entails  scaling or transforming the data to make each selection contribute equally during the training process. The performance enhancement of the machine learning models employing such has been verified by many studies [68]. In this study, Z-score normalization was utilized due to its sensitivity to outliers. The formula for Z-score normalization as shown in Eq. (1) is: where v ′ , v , v , and v denote the new value, original value, mean, and standard deviation of the variable values in the training samples, respectively. This method transforms the data with a mean of 0 and a standard deviation of 1.

Top-ranked features identification
The feature selection technique automatically selects those features which are most significant for output prediction. This method thus helps in reducing overfitting and training time as well as improving accuracy. Several different feature selection techniques, e.g., univariate selection, recursive feature elimination (RFE), principal component analysis (PCA), bagged decision trees like random forest and extra trees, and boosted trees like Extreme Gradient Boosting (XGBoost) etc. have been used in the literature. However, the present study investigated and compared three feature selection techniques: (1) XGBoost [69], (2) Extra tree [70], and (3) Random Forest [71,72] to determine the best feature combinations for sex prediction using different ML classifiers.

Model development
The present used and compared different machine learning classifiers such as Gradient boosting [69], XGBoost [73], Extra tree [73], K-nearest neighbour (KNN) [73], Adaboost [73], Random Forest [73], linear discriminant analysis (LDA) [71,72], and Logistic regression [74] using the best feature combination which was identified by the feature selection techniques for sex prediction. Then we investigated a stacking approach where a combination of base learners and meta learners was used to classify the sex of the subject. Here, we have used the top performing three ML classifiers as base learners and the predictions of these models were used as inputs to different machine learning classifiers as meta learners to make the final decision. Eight different machine learning classifiers as meta learners in the stacking approach to find the best performing classifier were investigated. If a single dataset A, which consists of feature vectors ( x i ) and their classification probability score is y i . At first, a set of base-level classifiers M 1 , … … , M p is generated and the outputs are used to train the meta-level classifier as illustrated in Fig. 3.
Five-fold cross-validation was used to create a training set for the meta-level classifier. Among these folds, base-level classifiers were trained on four-folds, leaving one fold for validation. Each base-level classifier predicts a probability distribution over the possible class values. Thus, using input x, a probability distribution is created using the predictions of the base-level classifier set, M: where (c 1 , c 2 , … … , c n ) is the set of possible class values and P M c i |x denotes the probability that example x belongs to a class c j as estimated (and predicted) by classifier M in Eq. 2. The class, c i with the highest-class probability, P M j c i |x is predicted by the classifier, M. The meta-level classifier M f and attributes are thus the probabilities predicted for each possible class by each of the base-level classifiers, i.e., P M j c i |x for i = 1,…., n, and j = 1,…., p. The pseudo-code for the stacking approach is shown in Algorithm 1.

Performance metrics
Different classification models were compared using the topranked features from the testing data to calculate the performance matrices in classifying male and female classes. The best performing classifier was evaluated for different combinations of features as input to the model by calculating the receiver operating characteristic (ROC)-area under the curve (AUC) and performance metrics such as accuracy, precision, sensitivity, specificity, and F1-Score as shown in Eqs. (3)(4)(5)(6)(7). Different classification algorithms and different features' combinations of the best performing algorithm were validated using fivefold cross-validation where training and testing were done on 80% and 20% of data, respectively, and this process was repeated 5-times to test the entire dataset. Weighted average within 95% confidence interval was calculated for sensitivity, specificity, precision, F1-score, and overall accuracy from the confusion matrix that accumulates all test (unseen) fold results of the fivefold cross-validation. The correct estimation of a male subject is true positive (TP), and the correct estimation of the female subject is true negative (TN). The incorrect estimation of the male subject as female is false negative (FN) and the incorrect estimation of the female subject as male is false positive (FP)

Results
The values of Lin's concordance correlation coefficient of reproducibility ranges between 0.974 and 0.998 (Table 1). These values fell within the recommended range from 0.90 to 0.99 which indicates that all patella measurements are easily reproducible and the subsequent data analyzed in this study are not significantly affected by measurement error. For clarity, the analyses on discriminant function and machine learning are presented separately. In the first section, results from descriptive statistics, univariate and multivariant discriminant function analysis are presented while in the second section, best feature selection, validation of ML model and stacking technique are reported.

Discriminant function analysis
The descriptive statistics of all measurements for pooled data are shown in Table 2. The male showed significantly higher (p ≤ 0.05) mean measurements for all measures than the female. All patella measurements were subjected to stepwise and direct discriminant function analyses. The unstandardized coefficients, constants, average accuracies, cross-validation in correct sex classification, and the sectioning points for individual measurements are shown in Table 3. The best performing variable, maxh, presented with an acceptably high average accuracy of 82% (Table 3) while the other variables presented with low average accuracies which ranged between 69 for lafb and 79% for maxb (Table 3). Table 4 shows the stepwise and direct discriminant function analysis using various combinations of measurements. In the stepwise analysis, four measurements were selected, namely, maxh, maxb, maxt, and haf. The discriminant function equation derived from these measurements provided an average accuracy of 84.2% as shown in Table 4. The other functions in Table 4 were formulated using direct discriminant function analysis of patella measurements. The average accuracies in correct sex classification ranged between 81.9 (Function D5, Table 4) and 83.5% (Function D1, Table 4). The results of the cross-validation using the leave-one-out classification showed that the average accuracy in correct sex classification for most of the presented functions remained unchanged (Table 4). Functions D2 and D4 showed a minimal and insignificant drop in classification rate of 0.8% thereby confirming the validity of the derived functions from the pooled data.

Best feature combination for sex prediction
In this study, three feature ranking algorithms were used to identify top-ranked features among all features. These top-ranked features were investigated with 8 different classifiers which were performed with Top-1 to Top-6 features to identify the best performing classification model and best feature combination simultaneously for sex prediction. It was observed that RF and ET feature selection techniques produced the same feature ranking while the XGBoost feature selection algorithm produced different rankings as shown in Fig. 4. In this study, Top-3 features (maxh, maxb, and maxt) using a random forest (RF) feature selection algorithm with random forest machine learning (ML) classifier outperformed other classifiers. Table 5 shows the overall accuracies and weighted average performance for the other matrices (precision, sensitivity, specificity, and F1-score) Fig. 4 Top-ranked features using different feature selection techniques; A XGBoost, B random forest, and C Extra Tree algorithms with a 95% confidence interval to identify the best feature combinations using Top 1 to 6 features for fivefold crossvalidation using the best classifier (AdaBoost classifier for XGBoost feature selection and RF classifier for RF and ET feature selection algorithms). It is clearly seen that the Top-3 features (maxh, maxb, and maxt) from RF and ET feature selection techniques produced the best performance of overall accuracy, and weighted precision, sensitivity, specificity, and F1-score of 89.61%, 89.67%, 89.62%, 89.62%, and 89.61%, respectively, using RF classifier for sex prediction. It was noticed that six features were required in the case of the XGBoost feature selection technique to produce the best performance of overall accuracy, weighted precision, sensitivity, specificity, and F1-score of 89.23%, 89.31%, 89.23%, 89.24%, and 89.22%, respectively, using AdaBoost classifier, whereas similar performance was produced by only three features from RF and ET feature selection techniques with RF classifier.

Development and validation of different ML and stacking models
We investigated the best combination of three features (maxh, maxb, and maxt) and selected the best ML classifiers among eight classifiers as base models and trained different ML classifiers as meta-learners. We selected top two models (RF and ET) where the overall accuracies, and weighted precision, sensitivity, specificity, and F1-score were 89.23%, 88.64%, 90.00%, 88.46%, 89.31%, and 85.34%, 85.27%, 85.03%, 85.45%, 85.14%, respectively ( Table 6). The stacking approach was trained with RF and ET classifiers as a base learner and Gradient Boosting classifier as meta learner outperformed other meta learner classifiers with the performance of overall accuracy, and weighted precision, sensitivity, specificity, and F1-score of 90.77%, 89.55%, 92.3%, 89.23%, and 90.9%, respectively. Figure 5A shows the confusion matrix of the best performing ML classifier (RF classifier), and Fig. 5B shows the confusion matrix of the best performing stacking model (with Gradient Boosting classifiers as a meta learner). It can be noticed that even with the best performing RF classifier, 13 out of 130 male subjects were miss-classified as female and 15 out of 130 female subjects were miss-classified as male when the stacking model with Gradient Boosting classifier as a meta learner outperformed other ML classifiers, where 120 out of 130 male subjects were correctly classified as male and 116 out of 130 females were correctly identified as a female with the stacking model. Thus, the stacking model outperformed other state-of-the-art ML classifiers. Figure 6 shows the AUC) /ROC curve (also known as AUROC (area under the receiver operating characteristics)) for sex identification using different ML classifiers, which is one of the most important evaluation metrics for checking any classification model's performance. It is apparent that the stacking model outperformed other ML classifiers for classification with 92.65% AUC (Fig. 6).

Discussion
Sex prediction from measurements of bones that display sexual dimorphism is an important aspect of forensic anthropology. Population-specific standards which are generally considered to provide the best estimation of sex have been published for the skull and postcranial elements in different parts of the world [30,40,44,50,75,76]. These became necessary because of the observed variation in the display of sexual dimorphism between different population groups [26]. Consequently, the application of standards for population groups is not encouraged for other groups. One disadvantage of using population-specific standards is having prior knowledge of the population group of the skeleton under forensic analysis [63,64].
In the present study, patella measurements of South Africans were shown to be sexually dimorphic and which is consistent with the results of previous studies on patella of Italians [77], Americans [78], Iranians [79], Spaniards [80], African Americans [81], Japanese [82], Turks [83], and Swiss [84]. The range of average accuracies obtained for pooled multivariate discriminant function equations (DFEs) and stacking ML technique in the current study (81.9-90.8%, Tables 4 and 6) compares well with those presented for previous studies in other parts of the world (Table 7). It is interesting to note that the highest average accuracies for all studies that utilized skeletal collection in the acquisition of data are approximately not more than 85% [46,66,80,81,84]. Other studies in which data were collected from radiological modalities and autopsy acquired data presented higher average accuracies in correct sex classification. This is an indication that the source of data and how these data are collected may influence the outcome of the results of the analysis.
In addition, the average accuracies for the pooled data for the patella from the current study using discriminant function analysis (81.9-84.2%) are similar to those observed for SAED (75-85%: [46]) and SAAD (78-85%: [66]). The highest drop in average accuracies in the current study (0.8%) is lower than those from population-specific DFEs for SAED and SAAD which were 2.5% and 3.3%, respectively. This observation of a lower drop of average accuracies for DFEs obtained from pooled data compared to population-specific DFEs agrees with Bidmos and Mazengenya [85] in which the highest drop in average accuracies for pooled data DFEs was 0.9%. This observation indicates a better validity of pooled DFEs compared to population-specific DFEs. Another previously documented advantage of the application of DFEs from pooled data is that they can be applied to an unknown skeleton without the prior knowledge of the population group [64].
The same performance trend is observed in the current study using the ML algorithm compared to the conventional statistical model. The standards generated for sex classification produced higher average accuracies (Table 4) compared to those generated using discriminant function analysis (Table 3). Compared to the average accuracies for the pooled data for the patella from the current study using discriminant function analysis (81.9-84.2%), the stacking machine learning approach provides an overall accuracy of 90.77%. This clearly indicates that with the application of the machine learning paradigm a better classification of sex from the patella measurement is possible.
From the aforementioned, linear and volumetric measurements of the patella are useful in human identification and have produced acceptably high average accuracies in correct sex classification. However, human identification from skeletal remains can be demanding especially in a country like South Africa with diverse population groups. Consequently, the application of population-specific DFEs in the human identification process will require the prior assignment of the population group which might be difficult if not impossible in cases where complete skeletons are not available or in the  absence of bones that display obvious population-specific traits. Another confounding problem is the difficulty in the assignment of population groups to individuals who fall within the boundaries of other population groups [52]. This has led some researchers [63,64] to propose the idea of a generation of generic standards for the estimation of sex and stature, especially for population groups that have similarities. In both studies, the authors argue for the generation and use of generic equations for sex assignment [63] and stature estimation [64] citing the lack of adequate data and bone collections from which data could be collected for the derivation of population-specific standards in some countries. The pelvic bone is considered one of the most sexually dimorphic bones in the body based on its design for parturition in females. Measurements of this bone have been used in the generation of population-specific DFEs in different parts of the world [2]. Steyn and Patriquin [63] assessed the reliability of population-specific DFEs compared to those from pooled data from diverse population groups. They reported a comparable performance of population-specific and generic DFEs with regard to classification rates and concluded that population-specific equations are not superior to generic equations with regard to sex prediction using dimensions of the pelvic bone. Macaluso Jr [86] evaluated the reliability of generic equations that were published by Steyn and Patriquin [63] on a French sample and reported that the average accuracies of the pooled data remained unchanged when applied to a French sample. In addition, there was no significant difference between the average accuracies obtained from the use of population-specific equations and generic equations [86]. This observation provided further proof of the usefulness and applicability of generic equations for sex prediction using pelvic measurements to other related population groups, where the application of ML can significantly help.
Attempts have also been made to apply the notion of nonsuperiority of population-specific equations over generic equations using measurements of the vertebrae. Hora and Sládek [87] observed that anteroposterior and mediolateral body diameters were found to be universally applicable in sex prediction while other measurements of the studied vertebrae showed population specificity in the assignment of sex. In a similar study, Bidmos and Mazengenya [85] investigated the utility of pooled data in the generation of generic equations for sex prediction. They evaluated the accuracies of population-specific equations formulated from measurements of long upper limb bones of South Africans and noted that the average accuracies of generic equations are acceptably high (81 to 87%). In addition, the cross-validated accuracies remained largely unchanged thereby confirming the usefulness of these equations in cases where it becomes difficult to establish the population affinity of the skeletal remain under forensic investigation.
Recently, Indra et al. [84] assessed the validity of population-specific DFEs formulated for patella measurements of the contemporary Spanish population group on a Swiss sample. The average accuracies obtained by Indra et al. [84] ranged from 63 to 84% for patella which was similar to those presented in an earlier study by Peckmann et al. [80]. The results of the current study in which the average accuracies obtained for generic equations are comparable to those presented for population-specific equations for South Africans of European [46] and African descent [66] in agreement with the observation made in previous studies [63,[84][85][86][87]. This, therefore, shows the utility of generic equations when the patella is available for forensic analysis in South Africa.
The range of average accuracies for generic equations formulated in the current study (81.9-90.8%) is similar to those obtained for population-specific equations derived for South Africans of European (67.5-85%) and African descent (78.3-85%). This is in agreement with the observation that was made by Indra et al. [84].

Conclusions
Prediction of sex from recovered or discovered bones in human identification is a very important step in forensic anthropologists along with the estimation of age, stature, and population affinity. In this study, we have used a dataset of 100 people collected from a sample of patella of Mixed Ancestry South Africans (MASA). Six parameters maxh, maxw, maxt, haf, lafb, and mafb were used. Two types of investigation have been carried out in this study to compare the performance of conventional statistical analysis versus the classical machine learning techniques in the estimation of sex. Different discriminant function analyses were performed for measurements that exhibited significant differences between male and female mean measurements. On the other hand, several ML algorithms were trained, validated, and tested to identify the best feature combination for detecting the sex from the patella measurements. The range of average accuracies obtained for pooled multivariate DFEs is 81.9-84.2% while the stacking ML technique provides 90.8% accuracy which compares well with those presented in previous studies. In conclusion, findings from the current study show that generic models formulated from measurements of the patella of different population groups in South Africa are useful resent with reasonably high average accuracies. Consequently, they are useful in the prediction of sex in cases when the population affinity is either difficult or impossible to ascertain and their applicability to populations of Southern Africa will require validation studies in individual populations from different countries in the region.