Background

In forensic investigations and medico-legal practices, the determination of a person’s identity is the first and one of the most important tasks. Forensic analyses are usually performed by examining different parts of the body, like face, teeth, skull, and fingerprints. Routinely, fingerprints and palm prints are used by forensic officials for identification (Uthman et al., 2012). The reason is that a substantial amount of research has been conducted targeting the fingerprints and palm prints for human identification. Like fingerprints and palm prints, footprints are also widely recovered as pieces of evidence from the crime scenes. It is evident from the reported cases that culprits mostly remove their footwear to reduce the noise of walking, hence leaving their foot impressions at the crime scene (Khan and Moorthy, 2013). Thus, the footprint analysis can provide a great help in the forensic investigations to identify the sex.

The genetics, lifestyle, and climatic factors define the morphology of the human foot. Also the age, body mass index, sex, and population group effects on shape of the foot. If feet are different from one to other individual, then they are based on these factors. Sex-related differences in foot morphology are important especially in the designing of footwear and forensic anthropology. Furthermore, factors like surface variables and planter pressure distribution define the shape and depth of a foot. Various studies have been reported regarding the determination of sex, based upon information related to the footprints with a reasonable degree of accuracy on different populations across the world. The morphological characteristics being used in these studies have shown variation from population to the population for less erroneous estimation of sex in such populations (Khan and Moorthy, 2013; Krauss et al., 2011; Abd-Elazeem and Yousef, 2013).

In 2017, a study was carried out in Uttar Pradesh, India, to evaluate the importance and trust-ability of footprint dimensions in sex, stature, and age estimation. A total of 400 samples were collected from the ages of 10–65. The print was taken bilaterally, for study. A total of 7 measurements were used, out of which, 5 were toe lengths, recorded as T1–T5, and 2 were breadth dimensions for the right and left footprint. The left footprint measurements were greater than that of the right, for both the males and females. It was concluded that there exists a linear correlation between the footprint length and stature of a person for males and females, both. But for sex, there was a partial correlation found from the footprints and the same for the age (Singh and Yadav, 2017).

Another study on Haryanvi Jaat population of Haryana State of India was conducted in 2016 to find out sexual dimorphism using foot and footprint dimensions. A total of 400 samples (200 males and 200 females) were collected bilaterally with ages between 21 and 25 years. Two measurements were taken, i.e., one for length, from toe 1 to the exterior point of the heel of both feet, and the other was breadth, for the footprint. The study concluded that male has a greater footprint length as compared to female and inverse for footprint index (Walia et al., 2016).

Atamturk, 2010 presented a study to estimate sex from footprint measurements, e.g., heel breadth, ball breadth, and heel ball index and reported that all measurements (length, breadth, and heel breadth) of the footprints were larger in males, but the sex differences were not statistically significant. The study concluded that footprint parameters can be used in the estimation of sex but the heel ball index may not be that much useful in sex determination from footprints (Atamturk, 2010), and extension of this work was reported by (Krishan et al., 2011).

In 2006, a study was conducted on footprints of 320 volunteers for sex estimation in Iban Ethnics, East Malaysia. Linear regression equations were employed on the foot length for discriminating males and females from the given population (Oberoi et al., 2006).

A similar study was conducted on the Ghanaian population in 2015 for sex identification with the sample size of 126 students (60 females and 66 males) aged between 18 and 30 years. The lengths of all foot toes, breadth at the ball (BAB), heel ball index (HB index), and the breadth at heel (BAH) were used. The results demonstrated that all footprint dimensions have shown significant results except HB index. The accuracy to discriminate both sex was reported at 80% when toe 5 and BAH from the left footprints were used for analysis (Abledu et al., 2015).

Different parameters such as foot length and size, heel ball index, foot breadth measurements, i.e. breadth at ball and breadth at heel, have been used across the world by many researchers for various populations (Krishan et al., 2012, 2015). However, sex estimation through feet anthropometry is still an open problem as the scope is very wide due to diversity in the world’s population.

No studies have been reported related to the determination of sex-based upon footprint analysis on Pakistani population. The aim of the present study is to find out the utility of footprint for sex classification. For this purpose, footprint sample of the population of Punjab of Pakistan has been collected from different geographical areas of the province. Footprints of 142 male and 138 female volunteers were collected. All toe lengths, ratio of toe lengths, ball breadth index, and heel breadth index of both the feet of each individual from the study were measured from these footprints and analyzed for the sex determination.

Machine learning is a branch of computer science which gives the ability to the system to learn and predict future results with unseen data, it is also referred as the computational statics to build predictive models on given data and predict for unseen values, without explicit programming (Rughani and Bhatt, 2017). Machine learning is helping the forensics teams across the world in many ways; from individual identification, forensic cyber security, computer forensics, and forensic criminology are used to prevent and solve the crime cases (Ariu et al., 2011; Nasrabadi, 2007).

Machine learning provides different types of computer algorithms to solve various real-world problems, such as Naïve Bayes, Random Forest, Random Tree, REP Tree, and J48 algorithm, particularly for classification problems. Sex identification is also a classification problem; therefore, these machine learning classification algorithms can help in prediction of sex using considered parameters, which can aid forensic analysts in sex identification cases (Brennan and Oliver, 2013; Kim et al., 2014; Nath, 2006; Wang et al., 2013).

In this paper, Naïve Bayes, Random Forest, Random Tree, REP Tree, and J48 machine learning algorithms were used for sex classification. Naïve Bayes is simple but a very powerful classification algorithm based on the Bayesian inference for providing some reasonable accuracy value (Cichosz, 2014). Random Forest, Random Tree, and REP Tree build iterations of trees and reduce the errors by using mean squared, bagging and choosing best respectively for the best classification (Aljawarneh et al., 2017). J48 is a widely used machine learning algorithm for classification with very much enhanced pouring technique and fault tolerance that reduces misclassification (Heude et al., 2005).

Materials and methods

Sample collection

The present study was conducted on unrelated adult volunteers randomly selected from Punjab, Pakistan. The volunteers included 142 males and 138 females of age ranging from 18 to 50 years. None of the volunteers had family relations with each other as the anthropometric features of the human body are based on gene, and family dependence can lead to the similarity in the anthropometric features (Heude et al., 2005). The research procedure strictly followed the ethical research standards of University of Management and Technology, Lahore Pakistan.

Before sample collection, all volunteers were properly explained about the purpose of the study. All volunteers were active and healthy with no previous surgical history. To remove dirt, all volunteers were asked to clean their feet with soap and water. The quick-drying duplicating ink was uniformly spread on 0.30 × 0.30-m plain glass plate of 0.008 × 0.008-m thickness Rughani and Bhatt, 2017; Singh and Yadav, 2017. The participants were asked to put their feet one by one on the glass plate with normal force and then placed it on a plain A4 size white paper and lift up their foot without disturbing the paper (Robbins, 1986). All papers with vague footprints were excluded from the study, and the participant was requested to repeat the process for a clean print. By using this method, the static footprints of all participants were collected for both feet, i.e., left and right. In this paper, different measurements were used for the identification of sex. The identification can also be carried out if the crime scene investigator does not have all of the measurements in case of dynamic footprints, as all the static features are not usually present in a dynamic footprints (Reel et al., 2012).

Method flow chart

The first step of this study was the collection of footprints data from 280 adult volunteers including 138 females and 142 males from different areas of Punjab. The second step was taking the measurements of selected features, i.e., toe lengths, ball breadth, and heel breadth of the feet. A straight line was drawn from the most endpoint of heel to the lateral point of the first toe as shown in (Fig. 1), where the measurements of the length of toe were considered as the distance between the outer most points of the heel to the tip of the toe. Ball breadth was measured as the dimension between the most medial and the most lateral points of the footprint at the ball, and heel breadth was measured as the broadest distance through the heel. Table 1 contains the measurements that have been taken from each footprint. It clearly defines the standard followed in this study that how the lengths of 5 toes, ball breadth, and heel breadth was measured from the feet. The ball breadth index and heel breadth index were calculated using the following formula:

$$ \mathrm{BBI}:\mathrm{ball}\ \mathrm{breadth}\ \mathrm{index}=\left(\mathrm{ball}\ \mathrm{breadth}/\mathrm{length}\ \mathrm{of}\ \mathrm{toel}\right)\times 100 $$
$$ \mathrm{HBI}:\mathrm{heel}\ \mathrm{breadth}\ \mathrm{index}=\left(\mathrm{heel}\ \mathrm{breadth}/\mathrm{length}\ \mathrm{of}\ \mathrm{toel}\right)\times 100 $$
Fig. 1
figure 1

Foot measurements

Fig. 2
figure 2

Methodology proposed by the present research

Table 1 Parameters and their measurements

Furthermore, the ratios between all toes were calculated as (toe 1 to toe 2, toe 1 to toe 3, toe 1 to toe 4, toe 1 to toe 5, toe 2 to toe 3, toe 2 to toe 4, toe 2 to toe 5, toe 3 to toe 4, toe 3 to toe 5, and toe 4 to toe 5).

The third step was the implementation of classification techniques for sex identification. Algorithms that were chosen for this purpose were Naïve Bayes, Random Forest, Random Tree, REP Tree, and J48 algorithm (Cichosz, 2014; Kim et al., 2014; Rughani and Bhatt, 2017). The measurements obtained in the second step were the input parameters to the abovementioned algorithms. In the fourth step, a final decision was made by the classification algorithms used (in step 3) for sex identification. The algorithm will use the features of feet and perform its analysis for differentiating males from females in the available sample (Fig. 2).

Statistical analysis

The statistical analysis was performed using Weka 3.8 for windows. We employed Naïve Bayes, J48, Random Forest, Random Tree, and REP Tree for sex classification using BBI and HBI (Aljawarneh et al., 2017; Cichosz, 2014; Kalmegh, 2015). Naïve Bayes is a classifier based on Naïve Bayes theorem. It requires a small set of training data to identify the sex. Random Forest classifier fits a number of decision tree classifiers on different samples of the dataset and calculates the average of all in order to improve the accuracy. Random Tree, randomly built trees; no pruning is involved and performs analysis on the basis of these trees. REP Tree is a fast decision tree learning algorithm. It is based on information gain and entropy which minimizes the error that arises due to variance. It considers all the attributes and constructs the decision tree with the help of variance and information gain and does error pruning. J48 is widely used as it creates a binary tree, and after building the tree, the algorithm is applied in each record of the database and classification is performed.

Naïve Bayes algorithm for continuous data is based on the equation

$$ p\left(x=v\vee {C}_k\right)=\frac{1}{\sqrt{2\pi {\sigma}_k^2}}{e}^{\frac{-{\left(v-{\mu}_k\right)}^2}{2{\sigma}_k^2}} $$
(1)

Let x be the length of a given toe, where data segments on male and female classes and variance are calculated for each male and female class, and let μk be the mean value of Ck, i.e., associated class and \( {\sigma}_k^2 \) the variance of associated class Ck and v the observation with unknown class, i.e., either male or female. Then, for a given class, the probability distribution Ck, p(x = v ∨ Ck) can be computed by putting values into the equation.

Random Tree algorithm is based on

$$ f=\frac{1}{B}{\sum}_{b=1}^B{f}_b\left({x}^{\prime}\right) $$
(2)

The J48 algorithm uses the entropy and information gain for making a decision for determination of sex.

$$ H(T)={\sum}_{i=1}^j{p}_i{\log}_2{p}_i $$
(3)
$$ IG\left(T,a\right)=H(T)-H\left(T/a\right) $$
(4)

The mean and standard deviation of the data was computed by using:

$$ \mu =\frac{\sum x}{N} $$
(5)
$$ \alpha =\sqrt{\frac{1}{N}{\sum}_{i=1}^N\left({x}_i-\mu \right)2} $$
(6)

where N is the total number of samples (male and female), “x” is the sample, “α” is the standard deviation, and “μ” is the mean of all samples.

Results

In this experiment, we identify the sex using different foot parameters (foot toe lengths, ratios of all foot toe lengths, ball breadth index, and heel breadth index). After performing the analysis, we attained the following results.

Abbreviation, i.e., LFT1, for left foot toe 1 up to LFT5 for left foot toe 5 and similarly RFT1, RFT2…RFT5 for right foot toe 1, was used in the description of the results.

Comparison between the mean of right and left foot (both males and females)

In males, a greater mean value was observed for foot toe 1 and ball breadth index of the left foot, while a higher mean value was observed for the heel breadth index of the right foot in comparison to the left foot.

Similarly, in females, a greater mean value was observed for ball breadth index and foot toe 1 of the left foot, whereas a higher mean value of heel breadth index was observed of the right foot as compared to left (Table 2).

Table 2 Descriptive statistics: left foot dimensions and foot index among Punjabi population

Comparison between the standard deviation of the right and left foot (both males and females)

In males, the standard deviation was noticeably high for the foot toe 1 and heel breadth index of the right foot, while a higher standard deviation was noticed for the ball breadth index of the left foot as compared to right.

In females, the higher standard deviation was observed in foot toe 1 and ball breadth index of the right foot, whereas a higher standard deviation value was noticed for the heel breadth index of the left foot in comparison to the right (Tables 2 and 3).

Table 3 Descriptive statistics: right foot dimensions and foot index among Punjabi population

Determination of accuracy

The accuracy in the classification problems is measured as how much the algorithm identifies a correct class for the given instance of data and whereas the error is the number of percent of misclassified instances of data (Ariu et al., 2011). In this paper, there were two classes: “male” and “female”, the precision of the algorithm will be measured as how much the machine learning algorithm correctly classified a male as male and a female as female.

Sex classification based on left foot parameters (toe 1 length and indexes)

Sex classification based on the left foot ridges was performed using different algorithms (Naïve Bayes, J48, Random Forest, Random Tree, and REP Tree). The input parameters to these algorithms include the LFT1 (left foot toe 1), BBI, and HBI of the left foot.

Using the J48 algorithm, the value of TPs (true positives/accuracy) was higher in males as compared to females. The accuracy rates were 0.873 and 0.841 for males and females, respectively, while the overall accuracy rate was observed at 85.7% (Table 4).

Table 4 Accuracy rates for sex determination through following algorithms using toe length and foot indexes

The Random Forest algorithm demonstrated an overall accuracy rate of 85%, while the TPs (accuracy) for females were recorded at 84.1 and 85.9% for males. By applying Random Tree algorithm, the overall accuracy rate was observed at 83.9%. However, the value of TPs was greater for females than that of males. Individually, the accuracy rates were 0.841 and 0.838 for females and males, respectively (Table 4).

The REP Tree algorithm exhibited an accuracy rate of 82.5%, whereas the TP values were 0.826 and 0.824 for females and males, respectively. Through the Naïve Bayes algorithm, the accuracy rate was 0.906 for females and 0.852 for males, whereas the overall accuracy rate of sex classification was observed at 87.8% (Table 4).

Hence, if we use FT1, BBI, and HBI of the left foot as parameters, the Naïve Bayes algorithm is more suitable for sex determination with 87.8% accuracy. However, the REP Tree algorithm demonstrated the lowest accuracy rate (82.5%).

Sex classification based on left foot parameters (toe lengths and ratios)

Sex classification was performed using left foot parameters (LFT1+ LFT2+ LFT3+ LFT4+ LFT5+ left foot toe ratios). Through Random Tree, Random Forest, and Naïve Bayes algorithm, the value of TPs was observed higher in females than that of males. In comparison, through J48 and REP Tree algorithm, the value of TPs was observed higher in males than that of females.

Through the Naïve Bayes algorithm, the accuracy rate was 0.91 for females and 0.83 for males while the overall accuracy was observed at 87.5% (Table 5). The accuracy rates, using Random Tree algorithm, were 0.833, 0.824 for females and males respectively while the collective accuracy rate was 82.8%. The accuracy rates, using the Random Forest algorithm, were 0.870 and 0.859 for females and males, respectively. However, the collective accuracy rate was 86.4%. The J48 algorithm showed an accuracy rate of 0.859 for males and 0.855 for females. However, the overall accuracy rate was at 85.7% (Table 5). The accuracy rates obtained by using REP Tree algorithm were 0.873 and 0.841 for males and females, respectively, while the collective accuracy rate was 85.7% (Table 5).

Table 5 Accuracy rates for sex determination through algorithms using toe lengths and toe length ratios

Thus, Naïve Bayes algorithm demonstrated the highest rate of accuracy at 87.5% for sex classification by using the selected parameters of the left foot, whereas Random Tree algorithm showed the lowest accuracy rate of 82.8% among all.

Sex classification based on right foot parameters (toe 1 length and indexes)

In this experiment, we identify the sex through J48, Random Tree, Random Forest, REP Tree, and the Naïve Bayes algorithms using right foot parameters (FT1, HBI, and BBI). Using these algorithms, it was observed that, in all of these, the value of TPs (accuracy) was greater in females than that of males while through REP Tree, males showed greater TP value. The accuracy rate by using the J48 algorithm was 0.88 and 0.80 for females and males, respectively. However, the overall accuracy rate was observed at 84.2% (Table 6). The Random Tree algorithm revealed an accuracy of 0.812 for females and 0.810 for males. However, their combined accuracy was observed at 81% (Table 6). Naïve Bayes algorithm demonstrated that there is a significant difference between the TP value of both sex. The accuracy rate was 0.92 in females and 0.81 in males. Through the Naïve Bayes algorithm, the overall accuracy rate of sex classification was observed at 86.4% (Table 6). The overall accuracy rate by using Random Forest algorithm was recorded at 84.2%. It showed an accuracy of 0.877 for females and 0.810 for males (Table 6).

Table 6 Accuracy rates for sex determination through following algorithms by using toe 1 lengths and indexes

Through REP Tree algorithm, a greater TP value was observed for males than that of females. The accuracy rates were 0.859 and 0.848 for males and females, respectively. Collectively, the accuracy rate was observed at 85.3% (Table 6). Now, we can say that if we use the parameters of the right foot (FT1, BBI, and HBI), then the Naïve Bayes algorithm gives the best accuracy rate (86.4%) among all. That is why it is the most suitable choice for these parameters, whereas Random Tree Algorithm gives the lowest accuracy rate of 81%.

Sex classification based on right foot parameters (toe lengths and ratios)

Sex classification was performed using the right foot parameters (RFT1+ RFT2+ RFT3+ RFT4+ RFT5+ right foot toe ratios). Through J48, Random Forest, REP Tree, and Naïve Bayes algorithms, TP values observed were greater in females as compared to males.

By the J48 algorithm, the value of TPs was 0.819 and 0.817 for females and males, respectively, while the overall accuracy was 81.7%. The accuracy rates by using Random Forest algorithm were 0.870 and 0.845 for females and males, respectively, while the overall accuracy rate was observed at 85.7% (Table 7). By REP Tree algorithm, the value of TPs was greater for females in comparison to males. The accuracy rate was observed at 0.906 and 0.810 for females and males, respectively. However, the overall accuracy rate was observed at 85.7% (Table 7). The Naïve Bayes algorithm showed an accuracy rate of 0.964 for females and 0.68 for males, whereas the overall accuracy rate was observed at 82.1% (Table 7). By applying Random Tree algorithm, the value of TPs was greater in males than that of females. The accuracy rates were 0.859 and 0.797 for males and females, respectively, while the overall accuracy rate was observed at 82.8% (Table 7).

Table 7 Accuracy rates for sex determination through following algorithms by using toe lengths and toe length ratios

Thus, if we use the selected parameters of the right foot and perform the sex classification analyses, Random Forest and REP Tree algorithms demonstrate the best accuracy rates (85.7%) among all, whereas J48 revealed the lowest accuracy rate (81.7%) as in comparison to all the other algorithms.

Overall result

After observing all the results, it can be concluded that the accuracy rates for identifying sex vary with the changing of parameters as described in Tables 4, 5, 6, and 7. Naïve Bayes algorithm showed the highest accuracy rates for all type of parameters except the ratios and toe lengths of the right foot (Table 7). It shows accuracy rates of 87.8, 87.5, 86.4, and 82.1% for the parameters used in Tables 4, 5, 6, and 7, respectively.

Naïve Bayes algorithm performed best among all algorithms with an overall accuracy rate of 86%. It can also be observed from these results that Random Tree algorithm has not performed as better as the other algorithms applied in this study. It demonstrated the lowest accuracy rates of 83.9, 82.8, 81, and 82.8% in Tables 4, 5, 6, and 7, respectively, and overall 82.6% accuracy.

Discussion

Across the world, many researchers have taken the initiative for utilization of footprints in sex determination. Different features of the foot have been used in various studies for accurate identification of sex. No such study has been conducted on any of the Pakistani population to date. The present study was conducted on the population of Punjab using different foot measurements, i.e., foot toe lengths, indexes, and toe ratios, for sex identification. All these parameters, i.e., foot toe lengths, indexes, and toe ratios for the left foot have greater values as compared to those of the right foot. The maximum accuracy rate of determination of sex was observed at 87.8% for the left foot and 86.4% for the right foot.

Previously, many reports have been published on foot-related studies and on the individuality of the foot and footprints, sexual dimorphism, height estimation, and association of foot and hand dimensions for the identification of a person in mass disasters (Kanchan et al., 2008; Krishan et al., 2011) (Table 8).

A study on Ghanaian population has shown that by using toe 1 (length of the toe 1), toe 2, toe 3, toe 4, toe 5, BAH (breadth at heel), and BAB (breadth at ball), the reasonable accuracy 69.8–80.3% can be obtained for determination of sex (Abledu et al., 2015). However, in the present study, several parameters have been used, i.e., toe 1, toe 2, toe 3, toe 4, toe 5, ratios of the toes, ball breadth index, and heel breadth index. Different algorithms have been used for classification of sex, but Naïve Bayes algorithm has the highest performance (87.8%) in all. If we compare the results of the present study with the study performed on the Ghanaian population, it can be seen that the results are better by the methodology proposed in this study (Table 8).

Table 8 Comparison of the present study and other reported studies

The results of this study can also be compared with the results described by Krishan, et al. in 2012, an Indian population-based study (Krishan et al., 2012). Heel ball index and ball breadth index of both feet have been used, and it was observed that the measurements at the heel and ball were significantly greater in males than females. They have concluded that heel ball index is not a significant parameter in sex determination (Krishan et al., 2011). But in the present study, mean value of ball breadth index is slightly greater in males, and the mean value of HBI is slightly higher in females (0.02 and 0.04 for left and right foot, respectively). The results of sex determination with these two parameters were not highly significant (almost 70%). Therefore, to increase the accuracy rate of sex determination, length of toe 1 was used along with the indexes. A significant rate of 87.8% accuracy was observed by the combination of these three parameters.

The table below provides a comparison of different studies that have been conducted previously for the discrimination of sex using footprints. It is evident from the above table that the method proposed in the current study outperforms those reported in the literature. Only the method proposed in (Abd-Elazeem and Yousef, 2013) has the same accuracy, i.e., 88%. However, they have used only two parameters from footprint measurements (foot length and foot breadth). Also, their experimentation was conducted on a sample with a small age bracket, i.e., 20–35 years of age (Abd-Elazeem and Yousef, 2013) whereas our study is based on the four footprint measurement parameters (lengths of all toes, ball breadth, heel breadth, and all foot toe ratios). In addition, the age window is much larger, i.e., from 18 to 50-year-old people which makes the results more robust and less biased. All other reported studies have lesser accuracy as well as a lesser number of parameters and smaller age windows. A bigger age window is helpful in the forensic analysis as the crime perpetrators are generally in the age bracket of 18–50 years.

After comparing the proposed method with others available for sex determination, it can be said that if we use foot length, heel breadth index, and ball breadth index and perform sex classification using Naïve Bayes algorithm, then the sex identification could be achieved with more precision and accuracy. Therefore, it will be feasible to use these parameters for sex determination for the population of Pakistan.

Conclusion

This study demonstrates a reasonable degree of association between the footprint and sex determination from the footprint samples of 280 volunteers of both sexes aged between 18 and 50 years. Different classification algorithms, i.e., J48, Random Tree, Random Forest, REP Tree, and Naïve Bayes, have been used for the classification of sex. The results obtained from the Naïve Bayes algorithm found to be more accurate in predicting sex compared to the other algorithms. The percentage accuracy of establishing sex using ball breadth index, heel breadth index, and the toe 1 length is 87.8% with the help of Naïve Bayes algorithm, which is quite significant for use, the results are better by using left footprint parameters. It is possible that selected footprint measurements that show significant results for one population may not demonstrate the significant results for the other one. Therefore, more research needs to be conducted on different populations to determine sex from the foot with reliable accuracy rate. This type of study is conducted for the first time on individuals of the population of Punjab, Pakistan. The important application of this study is the identification of an individual during forensic investigations based on footprint measurements. Further studies with large data samples from different geographical areas of Pakistan could improve the utility of this kind of research.