9.1 Introduction

The purpose of this study is to validate the CFS framework developed during the last 2½ years and agreed by the MEPROCS consortium. Thus, all partners will be asked to deal with a variable (according to their availability and time constraints) number of CFS cases (positives and negatives) following all the recommendations collected within the framework referring to the procedure and the materials, the set of landmarks and criteria, the decision gradual system, and the requirements of the technical equipment. Individual and average performance will be compared with the performance achieved in Chap. 7 to check for a possible improvement as a consequence of following MEPROCS framework. Although a large number of cases have to be tackled within a significant reliability study, the current analysis will also serve to give an idea of the reliability of the methodology proposed. In addition, it can be also useful to study the influence of the technological means employed and the knowledge and experience of the practitioner performing the superimpositions.

The dataset used in this study consists of eight identification scenarios (S1–S8). Each of the first four scenarios (S1–S4) implies a comparison of one skull and four candidates with a variable number of AM photographs. Each of the last four scenarios (S5–S8) involves a comparison of only one candidate, with a variable number of AM photographs, and four different skulls. Thus, these eight scenarios correspond to a total number of 32 CFS cases, which at the same time involve up to 72 SFO problems as Table 9.1 details. The dataset was collected at two different institutions, the Laboratorio di Antropologia e Odontologia Forense (Italy) and the University of Vilnius (Lithuania), after obtaining informed consent from the responsible party for the deceased, and provided to the MEPROCS project following the data share protocol established by the project ethical committee.

Table 9.1 Summary of the materials forming the dataset for the study

The dataset provided for analysis consisted generically of a set of AM photos, photos of the skull (with scales), and a set of 3D models of the skull acquired by the structure light scanner Artec MHT. Each set of case studies had the following structure: cases 1–4 mimic a scenario with one skull and three possible candidates, where only one ante-mortem photo of each candidate is available. In case 5, a more complex scenario is simulated, including four skulls and four possible candidates, with only one available ante-mortem photo of each candidate. In cases 6 and 7, the scenario simulated includes one skull and only one possible candidate, with several photos of the candidate available for analysis. Participants were not asked to tackle neither all the cases nor all the superimpositions within each case.

All the participants had to follow MEPROCS CFS framework, that is, best practices and recommended assessment criteria (detailed in Chap. 7).

This framework includes:

  • The sources of uncertainty that have to be considered during the whole process

  • The sources of error that have be minimized as much as possible

  • The best practices that must be followed

  • The practices that should be avoided

  • The most appropriate pairs of homologous landmarks for orientation and assessment

  • The requirement and desirable features of the technical means employed

  • The most important criteria that must be evaluated for assessing the anatomical skull-face relationship

  • The degrees for the craniofacial correspondence evaluation and the requirements to achieve each degree

All these recommendations have to be followed by all the participants in the current study. As a result of having a common methodology, “only” the technical means (that have to fulfill some requirements as well) and the knowledge and the experience of the practitioner should make a significant difference among the process followed by all the participants.

There are some of the issues included in the best practices that cannot be fulfilled due to the multicenter nature of the study. In particular:

  • Use the real skull to confirm correct fit of the mandible with the cranium.

  • Use the real skull and mandible to articulate the dentition and establish centric occlusion.

  • Locate and mark landmarks on the skull before scanning.

To minimize these shortcomings, we have obtained a digital replica of all the skulls with a precise 3D scanner device, which provided both accurate and realistic 3D models according to the original geometry and texture of the physical skull (see Fig.9.1).

Fig. 9.1
figure 1

Skull 3D model without texture information (on the left) and with texture information (on the right)

Cranium and mandible are provided as separate 3D objects, so the participants can reproduce the position of the mandible as displayed in the AM photograph. In addition, they are also provided articulated as a single 3D object to better display the mandible articulation to minimize the effect of the second point. Landmarks were not located before scanning to avoid bias in the study since this point is already an issue that creates difference among different experts.

The fulfillment of the rest of the points included in the best practices is not always possible. The participants are limited by the quantity and quality of the materials and thus, according to the decision degrees in CFS, the confidence on their decision has to be adapted. Tables 9.2, 9.3, 9.4, and 9.5 summarize all these information. For each case study, it details the quantity and quality of the material of the corresponding CFS problems (for CFS problems for each identification case study). Each SFO implies the comparison of a single AM photograph with a skull 3D model. Thus, the following columns give information about the quality of both the image and the skull. In particular, the view of face of the subject within the photograph, if the photograph is the original or has been modified (mainly cropped from a larger photograph), if it is digital photograph or it is scanned, if the quality (mainly resolution) of the image fulfills MEPROCS guidelines, if the teeth of the subject are visible, and if there is any obscuring object or phenomena complicating the identification. Some remarks are given also to indicate other difficulties; in this study in particular, the perspective of some of the photograph could be difficult to achieve during the skull superimposition. Finally, it should be included information about the dentition of the skull, a “NO” means that the skull does not have enough dentition for teeth comparison purposes.

Table 9.2 Summary of the materials forming the dataset employed for the study and ground-truth data
Table 9.3 Summary of the materials forming the dataset employed for the study and ground truth data
Table 9.4 Summary of the materials forming the dataset employed for the study and ground-truth data
Table 9.5 Summary of the materials forming the dataset employed for the study and ground-truth data

The last two columns refer to the real skull-face relationship, that is, the ground truth data (GT). The first of these two columns, “GT.I indicates whether the result of the CFS problem should be positive (P) or negative (N). The last column, “GT.II”, refers to a ground-truth value different from the binary positive and negative real correspondence. It is a value established “manually” according to the scale defined in the decision degree table of MEPROCS framework and the quality and quantity of the materials described in Table 9.1. However, there is one important issue that could not be considered while establishing this second ground truth. As the decision degree table asserts: “There could be discriminatory characteristics that allow going left of right within the scale given an appropriate explanation in the report.” This is not possible to model a priory, because it belongs to the anatomical correspondence interpretation by the participant. We have only modified “GT.II” in CFS cases 3 and 4 in the identification case study 2, since the AM photographs clearly belong to Caucasian women while the cranium belongs to a Negroid.

Experience and familiarity with craniofacial identification techniques was also taken into account and level of experience of the participants was classified according to the following scheme:

  1. 1.

    No previous experience and no CFS-related training

  2. 2.

    No previous experience but CFS-related training

  3. 3.

    Short previous research experience and CFS-related training

  4. 4.

    Moderate previous experience with CFS real cases and CFS-related training

  5. 5.

    Broad experience with CFS real cases

The study was carried out by 12 participants from the following institutions: University of Granada (Spain), Legal Medicine and Forensic Sciences Institute (Peru), Complutense University of Madrid (Spain), University of Melbourne (Australia), Azienda Ospadaliera-Universitaria di Trieste (Italy), Russian Academy of Sciences (Russia), National Research Institute of Police Science (Japan), University of Milan (Italy), and Portuguese Judiciary Police (Portugal). Table 9.6 lists all the participants (numbered from 1 to 12) with their corresponding level of experience. Since not all the participants completed the whole study, information on the number of CFS addressed is provided as well. Finally, the last column depicts an approximate degree (linguistic scale) of the fulfillment of the guidelines suggested in the framework. Differently to the first study, there were no participants following neither a computer-aided manual video superimposition approach nor a computer-aided manual photo superimposition approach. They all followed a computer-aided 3D-2D superimposition approach, manual in every case but participant 1 who employed a semiautomatic software.

Table 9.6 Participants of the study, their experience related to CFS, and number of CFS cases addressed

After finishing every superimposition, participants were asked about the fulfillment of each point within the best practices and the practices that should be avoided in order to obtain a degree of fulfillment of the framework. They were also asked if the computer tools employed are in line with the requirements and desirable features established along the framework. Additionally, we double checked the same points through the examination of the CFS image results provided by the participants. As a result, a few contradictions were found in some cases while absence of evidence supporting participants’ answers in others. Thus, we finally arrived at Tables 9.7 and 9.8, which in most of the cases collect participants’ answers (Wp: Whenever it was possible; “N”: No; “Y”: Yes) but sometimes the values were changed when it was clear that the participant does not apply a particular point of the framework. Additionally, when no evidences were provided, despite participant response, we indicate “Ne,” that is, participants answered “Yes,” but there is no evidence that they fulfilled this point.

Table 9.7 Best practices fulfillment by the participants
Table 9.8 Fulfillment of the requirements and desirable features of the computer-aided tools employed by the participants

Concerning the practices that should be avoided all of them answer “Yes” to the first point (confirmation bias), whenever it was possible to the second and third points (edentulous skulls and one single low-resolution image).

Apart from the 17 methodological points included in Table 9.7, MEPROCS framework also defines some requirements (7) and desirable features (8) concerning the computerized tools employed for acquiring, visualizing, and superimposing the skull 3D models over the AM photographs. Table 9.8 depicts which of these characteristics are provided by the systems they employed and which ones not.

Finally, participants were asked to fulfill a table with the corresponding degree of consistency of MEPROCS recommended criteria for each SFO addressed to include at least one image showing the superimposition obtained, indicating the time employed (see the last column of Table 9.8). At the end, for each CFS they were asked to provide a numeric decision according to the defined degrees of decisions (Damas et al. 2014). That numeric decision had to be supported by a decision report summarizing the main significant anatomical similarities and/or inconsistencies.

9.2 Results

A total number of 244 CFS problems, involving 382 SFOs, have been tackled by the participants. Their performance was measured according to true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates, together with expert proficiency (EP). All the indicators were calculated as the mean performance of the participant according to the number of CFS addressed. Additionally, we have also calculated a correlation value ρ 1and a similarity value s.

Expert proficiency is calculated as \( \mathrm{EP}=\frac{\mathrm{TN}+\mathrm{TP}}{P+N}, \) where P and N are the number of positive and negative cases, respectively.

The equation for the correlation coefficient ρ 1 is: \( \frac{\sum_i\left({x}_i-\overline{x}\right){\sum}_i\left({y}_i-\overline{y}\right)}{\sqrt{\sum_i{\left({x}_i-\overline{x}\right)}^2{\sum}_i{\left({y}_i-\overline{y}\right)}^2}}. \)

The equation for the similarity coefficient s is: \( \left|1-\frac{\sum_i\left(\left|{x}_i\right|-\left|{y}_i\right|\right)}{N\times 3}\right|. \)

where X and Y are two sets of samples, \( \overline{x} \) and \( \overline{y} \) their respective means, and |x i| and |y i| the absolute values of a particular sample of sets X and Y, respectively. N is the total number of samples. In Tables 9.9, 9.10, and 9.11, the set of sample X and Y corresponds to “GT.II” values (Table 9.2) and the identification decisions made by a particular participant, respectively.

Table 9.9 Identification performance of all the participants according to the number of CFS cases they addressed
Table 9.10 Identification performance of different groups of participants according to the degree of fulfillment of the framework
Table 9.11 Identification performance of different groups of participants according to the participants’ CFS experience

While ρ 1measures the correlation between the participants’ decision and the ground-truth value according to MEPROCS scale (“GT.II”), s considers only the absolute values of the same variables. Thus, ρ 1 calculates a correlation value according to correct and incorrect decisions and how well correlated is the degree of support of such decisions with the scale defined in MEPROCS framework. In contrast, s is intended to isolate the measurement of the correlation in the degree of support (either positive or negative) of the decision.

Table 9.9 details participants’ performance together with the degree of fulfillment of the framework according to the following scale: High (“H”), Moderate (“M”), Low (“L”), and their possible combinations.

Considering only the global performance (“EP”), 5 of the 12 participants carried out more than 90% of correct identification decisions. In particular, P8 managed to achieve a 100% of correct decisions although he/she only addressed 7 out of the 32 CFS cases. Contrary, P1 with a 96.87%, P4with 93.75%, and P11 with 90.62% of correct decisions tackled all the CFS cases (32). The worst results were achieved by participant 10 with only 57.14% of correct decisions over the seven CFS cases faced. The average performance is 83.88%. With a similar tendency as the previous study developed (deliverable D4.4), “TN” and “FP” rates are clearly better than “TP” and “FN.” In fact, four participants (P1, P2, P4, and P8) perfectly addressed negative cases, and the average “TN” rate is 90.27%. Additionally, looking at the correlation values (ρ 1), we can conclude that the decisions of the participants are well correlated with those expected within the scale given by MEPROCS framework. Table 9.9 also includes the significance (p-values) for the following null hypothesis “There is no linear correlation between the participants’ decisions and GT.II” with different degrees (*p < 0.05, **p < 0.01 and ***p < 0.001). Thus, according to the achieved p-values (super-index attached to each particular correlation value), all the participants’ decisions except that of P12 are correlated with the expected decisions. Finally, higher values are obtained in the case of the similarity coefficient s. Such values increase because only the degree of support is considered.

The same average values are depicted in the first column of Table 9.10. However, we also include different average rates corresponding to different groups of participants with an increasing degree of fulfillment of the framework (see Table 9.10). In this sense, the second column refers to those participants who obtained at least a medium degree, the third includes only those participants who obtained at least medium-high degree, and finally, the last column includes participants with high-medium and high degree of fulfillment of the framework. From this figures, we can appreciate a significant improvement in participants’ performance in line with the degree of fulfillment. Expert proficiency varies from an average value of 83.88% to 95.31% of the last group. Similarly, “TN” rates go from 90.27% to an outstanding 100.00%.Although correlation follows the same increasing tendency, values slightly increase within the first groups and decrease in the last one.

Previous experience in CFS appears to not be a relevant factor in the identification performance (see Table 9.11). Only slightly differences are found when not considering participants that asserted to not have previous experience and CFS-related training.

Finally, Table 9.12 compares the performance of those participants that accomplished the two experimental studies designed by MEPROCS with the aim of testing CFS methodologies, tools, and practitioner’s experience. While in the first one, the participants followed their own methodology, in the current study, they were forced to stick to MEPROCS framework, which also established some requirements and desirable features for the tools employed. The first important issue is the number of participants that performed the both studies, nine in total, addressing 520 CFS cases in the first study and 222 in the second. The average performance values are better in the second study for the percentage of correct decisions (84.09% against 79.75%), “TN” (90.54% against 83.33%) and “FP” (9.54% against 16.66%) rates. “TP” (61.35% against 62.77%) and “FN” (38.64% against 37.22%) rates are slightly worse in the second study. However, such percentages are really similar in both studies.

Table 9.12 Participants’ performance comparison according to the results of the current study (MEPROCS methodology) and a previous one where the practitioners followed their own methodology

Finally, we computed some statistics using the values given by the participants to each particular criterion in each particular SFO case. First, the average standard deviation is used to numerically examine the differences while assessing (values from 0 to 5) the same criterion in all the SFO cases by the different participants. These values range from 0.87 (criterion 3.19 in frontal view, dental information) to 1.61 (criterion 1.7 in frontal view, entocanthion vertical line right).

Differently to ρ 1, we have calculated the correlation between the SFO binary (either positive or negative) ground truth (“GT.I”) and the values given by the participants to each particular criterion in each particular SFO case. The equation for this correlation coefficient ρ 2is the same as ρ 1 but X sample refers now to binary ground truth of the SFOcase, “GT.I”(Tables 9.29.5) and Y sample refers to the values given to the different skull-face correspondence criteria by the different participants. Similar to Table 9.9, Table 9.13 also includes the significance (p-values). According to them, the majority of the values given to the criteria by the participants are correlated with the CFS binary ground truth values (“GT.I”). Only criteria 1.70 and 2.22 (in frontal view), and 4.17 and 4.14) in lateral view resulted to be not correlated. The strongest correlation (p-values lower than 0.0001) is achieved by criteria 3.19, 4.20, 4.50, 4.10, 2.23, 3.10 (in frontal view) and 3.19, 3.18, 2.40, 3.11, and 2.90 (in lateral view). Exactly the same statistical results were achieved using Spearman correlation (“Sp”) and regression (“r”), thus reinforcing the latter conclusions.

Table 9.13 Average standard deviation of criteria values (σ), Pearson correlation between criteria values and GT.I (ρ 2)

9.3 Discussion and Conclusions

The main goal of this experimental study is to validate the framework for CFS developed by MEPROCS consortium. Thus, a multiparametric analysis is discussed in this section according to framework understanding and fulfillment, participants’ performance, and correlation between expected decisions and those given by the participants. While the latter two issues can be quantitatively validates, only a qualitative validation is possible for the former.

9.3.1 Framework Understanding and Fulfillment

Last column of Table 9.6 shows the degree of fulfillment of the framework by each participant. This degree was “manually” fixed following a quantitative approach with a certain degree of subjectivity. Degrees have been assigned according to the number and importance of practices and requirements not fulfilled. Even if all the participants were asked to fulfill every single point of the framework, the average degree of fulfillment is moderate. The main reasons for not following the recommendations could be summarized as follows:

  • Impossibility due to limitations on the design of the study

  • Impossibility due to limitations on the practitioner skills or tools

  • Framework misunderstanding or resistance to change practitioner’s normal procedure

As we explained, a few points were not possible to be fulfilled because of the remote nature of the study and the impossibility to share the actual skull, in particular points 1, 2, and 4. Point 8, “During the growth and development stages use the most recent AM photos,” was not applicable in the study, since there are no cases corresponding to juvenile people. Almost all the participants answered “whenever it was possible” for points 5, 6, and 7. These points refer to the use of multiple AM photos in different poses, AM photographs with a good quality and images without obscuring objects. Among them, as Table 9.5 shows, there are some cases where these conditions are not fulfilled; hence, most of the responses were Wp. Still referring to the AM pictures, most of the participants followed points 9, 10, 11, and 12. However, some authors decided to not articulate the mandible according to the AM photograph (point 3). This task was identified as an important source of error (approach CFS without a correct mandible position) but also a source of uncertainty (it is not possible to know exactly the precise articulation). The lack of proper skills or the absence of adequate tools could be the reasons to avoid mandible articulation, which has been identified as a difficult task.

The main limitations arose with the computer software employed. Eight of the participants noted that their software does not allow to locate or show landmarks, four participants recognized that the computer program employed does not allow transparency mode. Also in the requirement part, it is not clear if the participants that assert using a software that properly projects the 3D skull employed this feature or there is not such a feature or is rather limited. When analyzing the degree of fulfillment of the desirable features for CFS recommended in the MEPROCS framework, the situation is even worse. Six of them do not provide visualization capabilities of 3D models with texture information, eight of them do not allow wipe mode, three do not provide tools for the simultaneous interaction with 3D skull and the AM photograph, and only two of them provide tools for marking contours and lines and distance measurement.

Although all the participants reported to fulfill point 11, “…do not use cropped images…”, four participants provided cropped images as a result. They then recognized to crop the image before performing the SFO. This could have a negative effect depending on the software employed and the perspective of the particular photograph.

A majority of the participants did not give evidences of fulfilling point 13, “analyze and describe separately both the skull and the face in the photograph(s) to be compared…”. A clear example is the identification scenario S2 where the skull belongs to a Negroid subject and three of the four possible candidates are clearly Caucasians. Just a few participants documented this issue and avoided to perform the corresponding SFO.

Opposite to point 15, “use as many criteria as possible in order to study the relationship between the face and the skull,” some of the participants only provided values for the framework criteria without further analysis of other criteria. It is not clear if they mentally or methodologically employed other criteria, but there is no evidence to consider that. Similarly, a few authors did not provide any evidence of following recommendations 16 and 17. “Consider the discriminative “power” of each anatomical criterion” and “Give an appropriate “weight” to each criterion …”, respectively. Behind the absence of evidences for points 13, 15, 16, and 17 could be the nature of the task (experimental study against real identification scenario) and time constraints that make practitioners to be less precise and rigorous in their reports.

Similarity coefficient s values are in general high, what denotes a good understanding and fulfillment of the decision degrees table of the framework.

The main conclusion of the framework understanding and fulfillment is that, in general, the best practices and software requirements were understood. However, it was difficult to follow all the recommendations mainly due to design limitations and lack of specific software.

9.3.2 Participants’ Performance

Results depicted in Tables 9.11 and 9.13 show a clear improvement in the performance linked with the fulfillment of the framework. This is quite obvious when considering incrementally those participants with at least a certain degree of fulfillment of the framework (Table 9.11). We can assert that for the given study, the relation between performance and framework fulfillment is almost linear. Even if the degree of fulfillment of the framework is not considered, there is still an objective improvement in the performance. This fact is demonstrated in Table 9.13. The results of the same group of participants in two similar studies were compared. In the first one, the participants followed their own methodology, while in the current study, they were required to follow MEPROCS framework. There is an improvement in the global performance and, in particular, in “TN” and “FP” rates. At this point, it is important to note one important difference between the types of identification scenarios presented in both studies. While in the current study, there are at least two AM photographs in 27 of the 32 CFS problems, in the study subject for comparison (Ibáñez et al. 2014b), only 4 of a total of 60 CFS problems involve the comparison of more than one AM photograph.

9.3.3 Correlation

Together with the analysis presented in Tables 9.7 and 9.8, the correlation values obtained reinforce the assumption that the introduction of framework improves participants’ performance. Although not directly related with “EP,” ρ 1 correlation is in general higher for those participants performing better (see Table 9.10). In addition, p-values demonstrated a strong correlation between the expected decision and the performance. Similarity coefficient s values are even higher than correlation ones. This closer relation could be expected, since s coefficient does not link participants’ decisions with correct decisions by just with the expected degree of support.

One important factor that could not be considered when we established the ground-truth values of each CFS case (“GT.II” in Tables 9.19.5) refers to the inability to model in advance discriminatory characteristics of either the skull or the face in the AM photograph that allow, as indicated in the definition of the decision degrees table of the framework, going left of right within the scale given an appropriate explanation. This fact presumably has an influence in the correlation values as ground-truth values did not consider discriminatory characteristics but just quality and quantity of the materials.

The assessment of the skull-face relationship criteria made by the participants resulted in its being correlated with the CFS cases ground truth (“GT.I”) in the majority of the cases. However, criteria 1.70 and 2.22 (in frontal view), and 4.17 and 4.14(in lateral view) are not correlated. We find two possible explanations of the lack of correlation for these criteria.

Although the proposed methodologies for achieving an optimal and unbiased SFO (Ibáñez et al. 2015) and for the analysis of the criteria for assessing skull-face correspondence in craniofacial superimposition (Ibáñez et al. 2014a) were themselves a great achievement, the conclusions drawn in the criteria study, as pointed out by the authors, were influenced by the materials employed (i.e., cone-beam CTs that lacked the upper part of the skull and presented additional inconveniences), and more thorough studies have to be developed. The low correlation values could be also explained through the different SFO obtained by the participants. As pointed out in the first study (Ibáñez et al. 2014b), that of course had a clear influence on the analysis of the morphological relationship. A common problem that still affects MEPROCS framework is the absence of an objective evaluation of the resulting SFO. Each participant followed a different approach, and it is not possible to say which one is better without the ground-truth data.