Introduction

Since the last decade, artificial intelligence (AI) is developing in almost all industries [1]. Particularly, AI applications have emerged in expertise fields, such as medicine [2], justice, and criminal law [3]. In addition, AI is expected to be developed in recent fields of medicine. For instance, in P5 (predictive, personalized, preventive, participatory, and psycho-cognitive) medicine, AI would support decision-making processes as well as diagnoses and prognoses [4].

Nowadays, AI may be considered as a modeling tool for specific tasks [5]. For example, an AI model may be specifically designed to detect breast cancer from mammograms [6]. In this review, an AI application is considered as a model integrated in a computer program or a part of a computer program that performs a specific task. This model can be built from data such as numerical or categorical variables, images, texts, or rules.

Therefore, one may expect to find AI applications developed for forensic medicine purposes in the literature. Besides, Tournois and Lefèvre gave an overview of the AI applications used by forensic pathologists or physicians in daily practice [7]. In this review, a systematic and reproducible method is provided to establish a state-of-the-art on the daily use of AI by forensic pathologists or physicians. Since scoping reviews are more indicated for providing evidence to inform practice than systematic reviews [8], a scoping review approach is proposed in this article. The objectives are to (i) identify the AI applications used by forensic pathologists or physicians and (ii) map the AI landscape in the expertise fields of forensic medicine by estimating the level of integration or maturity of the identified AI applications.

Methods

Protocol and registration

In this scoping review, the protocol was defined and adapted from the Preferred Reporting Items for Systematic Reviews and Meta-analysis Protocols extension for scoping reviews (PRISMA-ScR) [9]. Since scoping reviews are not allowed for registration on PROSPERO, no process of registration was performed. Therefore, no registration number was assigned to this review.

Information sources and search strategy

The articles were extracted from the PubMed, ScienceDirect, and Scopus databases from inception to September 28, 2022, using search queries described in Table 1.

Table 1 Search queries used for each database in order to extract articles for screening

Eligibility criteria

Articles were selected if there was any mention of AI used by forensic pathologists or physicians in practice or if articles described explicitly AI applications in one expertise field of forensic medicine. Those expertise fields include postmortem identification, postmortem interval estimation, the determination of the causes of death, and the clinic examination of living persons in a forensic context. However, articles were excluded if they mainly dealt with complementary analyses handled by experts who are not medical doctors in the fields of forensic toxicology, entomology, dentistry, anthropology, psychology, epidemiology, biometrics, and ballistics. Articles were also excluded from this review if they were published in a different language than English or French or if they mainly talked about the use of AI to analyze data for research purposes in forensic medicine. Only the articles with an available abstract in English or French were retrieved.

Selection of sources of evidence

The selection of articles was independently and blindly performed by two reviewers (LT, VT), on the basis of titles and abstracts by taking into account the eligibility criteria. A third reviewer (TL) selected the articles that were subject to disagreements between both the previous reviewers. The selected reviews were not included as reports; however, their references were included if they met the eligibility criteria.

Data charting process and data items

After article selection, the inclusion of articles in the scoping review was determined by a reviewer (LT) through the analysis of the whole text of articles. This analysis was performed with an analysis grid (see Table 2) derived from the TRIPOD checklist [10] and validated by the other two reviewers (TL, VT). It is important to mention that the final user of AI applications is rarely explicit in titles or abstracts. Therefore, articles describing AI applications for which the forensic pathologist or physician was not the final user were excluded from the review.

Table 2 Analysis grid used to extract relevant information from reports. aTRL: adapted technology readiness level

The level of maturity of AI applications described in the selected articles was then assessed by using an adapted Technology Readiness Level (aTRL) scale. The TRL scale is originally defined by 7 values augmented to 9 values corresponding to the maturity of a technology from the observation of the basic concepts behind that technology to its use in practice with success [11]. However, this version of the TRL scale is not suitable for the assessment of the maturity of AI applications in forensic medicine. First, the levels described in the original scale were specifically designed for aero-spatial applications. Second, the number of levels in this scale is not compatible with the details of information extracted from the selected articles. Therefore, levels of technology maturity must be adapted for forensic medicine applications. That is why this original TRL scale was reduced to 3 values corresponding to the formulation of the AI application (aTRL = 1), the stages of research and development of the AI model (aTRL = 2), and its use in daily practice (aTRL = 3).

Synthesis of results

The studies were grouped by expertise field of the forensic pathologist or physician, that is to say postmortem identification, the determination of the causes of death, and the estimation of the postmortem interval and clinical forensic medicine. For each expertise field, the number of articles and the highest aTRL were summarized to specifically assess the level of development and integration of AI by expertise field.

Results

Selection of the sources of evidence

The systematic search of the literature results in 436 records. After duplicates removal, 378 records are selected for screening. Based on the titles and the abstracts, 339 are excluded with 39 reports sought for retrieval and eligibility assessment on the full text. Among those reports, 1 is excluded because the full text is not accessible without reader registration. The application of the eligibility criteria to the full text of the 38 remaining reports leads to the exclusion of 8 reports, with 5 reports for which the final user of the AI model is not the forensic pathologist or physician, 2 reports that describe an AI model for research purposes only, and 1 report which does not describe any AI model. From this screening process, 30 articles are eligible in this review. It is worth mentioning that 5 reviews are identified along the screening process. However, the description of the AI applications in those reviews is not detailed enough to assess the performance and the level of maturity as well as applications described from primary sources. Therefore, those reviews are excluded. Nevertheless, the cited references within reviews are analyzed and included if they meet the eligibility criteria. This leads to include 5 reports from reviews. A total of 35 studies are thus included in this review (Fig. 1).

Fig. 1
figure 1

Adapted from Page et al. 2021 [12]

Selection of sources of evidence.

Characteristics of the sources of evidence

The characteristics of source evidence are described in Table 3. For each included study, the type of article and the purpose of the AI applications are summarized. The 35 studies show that AI applications are developed in thanatology, especially for postmortem identification [13,14,15,16,17,18,19], the estimation of the postmortem interval [20,21,22], and the determination of the causes of death [23,24,25,26,27,28,29,30,31]. In clinical forensic medicine, AI models are mainly designed for age estimation [15, 32,33,34,35,36,37,38,39,40,41,42,43,44] and gender determination [15,16,17, 45]. One AI model is aimed for the assessment and management of risk of violent reoffending among prisoners [46] and one for bruises dating [47].

Table 3 Characteristics of sources of evidence. CT, computed tomography; MRI, magnetic resonance imaging; PMCT, postmortem computed tomography

Results of the individual sources of evidence

The results of the individual sources of evidence are summarized in Table 4. For each included study, the aim, the type, the performance, and the maturity level of AI applications are summarized. The detailed results are available in Online Resource 1. The results show that all the included studies remain in research and development stages (aTRL = 2). Moreover, the performance of AI applications may seem too low for a use of developed AI models in daily practice. Indeed, if a low-performance model is considered as a model with performance metrics lower than 90% for classification tasks and greater than 1 year for age estimation error, then 22 AI applications [14, 16, 18,19,20,21, 23,24,25,26,27,28, 33, 36,37,38,39,40,41,42,43, 46] will not be performant enough for a medicolegal usage. Therefore, AI models seem not to be used in daily practice by forensic pathologists and physicians.

Table 4 AI performance and integration level for each AI application included in this review. AUC, area under the ROC curve; LR, likelihood ratio; MAE, mean absolute error; RMSE, root mean square error; R2, correlation coefficient

Synthesis of results

In summary, 35 AI applications are identified for a use by forensic pathologists or physicians in thanatology and forensic clinical medicine respectively (Table 5). In thanatology, 19 AI models may help forensic pathologists identify deceased individuals, estimate the postmortem interval, or determine of the causes of death. In forensic clinical medicine, 19 AI models may help forensic physicians estimate the age of young individuals, date bruises in physical assault contexts, and assess the risk of violent reoffending of prisoners. However, no AI application identified in this review seems to be currently used in daily medicolegal practice by forensic pathologists or physicians (aTRL = 2).

Table 5 Synthesis of results by expertise field. aTRL, adapted technology readiness level

Discussion

This review aimed at identifying the AI models used by forensic pathologists or physicians in their daily practices thanks to a systematic search of the AI applications intended for medicolegal practice and described in the literature. This search resulted in the identification of 378 articles from reference databases and the inclusion of 35 studies published between 1999 and 2022. For each study, the level of integration or maturity of each AI application was assessed in order to map the current medicolegal practices involving AI. The information extracted from the included reports showed that AI is developing in thanatology and clinical forensic medicine (see Table 5). In thanatology, AI models were designed for postmortem identification, the determination of the causes of death, and the estimation of the postmortem interval. In clinical forensic medicine, AI was used to estimate the age of living individuals, the risk of violent reoffending among prisoners and bruises dating. In [15, 16], and [17], an AI model was developed both for age estimation and gender determination. However, the final field of application of the AI models was not clear, that is to say that the expertise field in which the model is expected to be used was ambiguous. For instance, in [36] and [17], the AI model may be used for postmortem identification or age estimation in forensic clinical settings. Therefore, in this review, it was assumed that, when the final field application was not clear, if the model may be applied to several expertise fields, those fields were considered as application fields of the model.

It is worth mentioning that the included articles did not explicitly report any AI application that is currently used by forensic pathologists or physicians in daily practice to date. Therefore, the AI applications appeared to be still in research and development stages. Since the application of AI in forensic medicine is subject to a recent renewal of interest in forensic medicine, as suggested by the publication date of the articles, it may be too soon to observe AI applications in medicolegal routine. This result may also be due to a low model performance or common AI-based issues.

Model performance is summarized for each AI application in Table 4. Currently, there is no well-defined threshold above which model performance is considered high enough to use the model in production. Moreover, this threshold should differ depending on the AI application. However, a model that performs worse than non-AI methods described in the literature or gold standards may be considered as a low-performance model. The comparison of model performance with the non-AI methods by expertise field is given in Table 6. No numerical comparison of performance with gold standard methods was made in 27 reports, a similar or lower performance is found for 3 reports, and models outperform non-AI methods in 5 reports. The performance of the models and their comparison to non-AI methods is quantified in Table 7. Articles that did not provide a quantified comparison between the performance of the AI model and the performance of non-AI methods often compare the performance with previous studies in which other AI models were developed. In order to assess the relevance of a model to apply in medicolegal routine, a quantified comparison of model performance between the AI and gold standard method should be provided. Ideally, the performance metrics should be compared from the same dataset to avoid epistemic variations.

Table 6 Comparison of model performance with non-AI methods described in the literature for each included report
Table 7 Comparison of model performance between non-AI methods and AI models. AUC, area under the curve; MAE, mean absolute error

However, model performance should not be interpreted as is, since models may show good performance for a given dataset but may be biased towards the validation set. Thus, the model performance must be assessed with a test set in order to prevent biases [48]. A test set was used in approximately 58% of articles. However, no test set was used in 14 articles [13, 14, 17, 20, 24, 25, 32, 33, 35, 36, 38,39,40,41]. Therefore, model performance in those articles should be interpreted with caution.

Moreover, despite a good performance on the test set, a model may not be able to generalize to new data. In this case, the model performance may be overestimated due to model overfitting [49]. This issue, common when developing machine learning models, was explicitly handled in 11 reports by techniques based on model architecture [34, 41] and parameters [38, 40, 44, 45], input data [41], and validation steps [20, 43, 46, 47]. In [34] and [41], the model architecture was modified to reduce overfitting by dropout regularization, that is to say removing nodes in a model by a given probability in order to simplify it. Moreover, the authors in [41] added batch normalization layers in the model architecture. This technique is known to reduce the generalization error of the model [50]. AI models were also developed by transfer learning, that is to say the use of a pre-trained model which is then adapted for a specific task, such as age estimation [38, 44] or gender determination [45] of living individuals. In [40], the model parameters were frozen in part of the models along the training phase to prevent overfitting. This parameter fixation may only concern weights of batch normalization layers [31]. In [41], the authors also used data augmentation, that is to say an artificial increase of training data by using transformations, such as image rotations and translations for instance. Indeed, increasing the number of training data helps reducing the problem of overfitting in computer vision tasks [51]. To monitor the effect of overfitting, the performance of the model for the validation set was computed at given steps [20], all along [43] or at the end of the training phase [43, 47]. In [43], the authors did not compute the model performance only once as in [47] but 10 times by using tenfold cross-validation. This technique involves splitting a dataset into a training set and a validation set 10 times with different instances in the validation set for each fold. This gives rise to 10 datasets of training and validation sets. Then, each dataset is used to train a model on the training set and assess the performance on the validation set independently from the other datasets. Thus, cross-validation enables to monitor the effect of overfitting by comparing the performance of the training set and the validation set. Moreover, this technique enables to calculate a mean and a standard deviation of performance for the 10 validation sets, which leads to better assessments of model performance than the use of a unique validation set. To sum up, several techniques may be used to reduce overfitting when developing a model. However, the number of articles that explicitly defined how overfitting was handled is clearly insufficient. This leads to wonder whether the models are able or not to generalize to new data in articles that did not handle overfitting. Therefore, model overfitting must be studied before any use of AI model in medicolegal routine.

The datasets used to develop a model should also be taken into account when assessing the performance of a model, since that model is trained for input data with specific characteristics. In this review, all the studies clearly defined input characteristics or eligibility criteria, except for [23] and [28] which did not show any restriction on the study population in terms of excluded cases (see Online Resource 2). Thus, the data used to develop models may not be representative enough for a given case. For instance, the authors in [19] took photographs of volunteer’s bruises made by projectiles fired from paintball guns in order to date the resulted injuries. All the volunteers were between 22 and 68 years old. The final model showed good performance metrics (> 96% for precision, sensitivity, and specificity). However, only one volunteer was above 40 years old and all the injuries were located on an arm, a leg, the back, the chest, or a buttock. For those reasons, despite the good model performance, the model may not be able to date bruises from the head, which is a target of injuries in cases of domestic violence for instance [52], or for people aged above 40 years old since it was not well trained on those characteristics. Therefore, the restriction of input data characteristics may prevent the models to be used daily by forensic pathologists or physicians due to non-representative datasets used to develop the models reported in this review. Moreover, in classification models, output data may be imbalanced, that is to say that data categories are over-represented compared to others, which often leads to a good model performance for those over-represented categories at the cost of a low performance on the other categories. From the included studies, data output appeared imbalanced towards age [16, 34, 35, 37, 40, 42,43,44,45] and feature classification [19]. Data imbalance may lead to model biases towards the most represented classes [53]. Therefore, model performance assessed from high data imbalance should be carefully interpreted for use in production.

It is worth mentioning that neural networks, a type of AI model that currently requires a high volume of data compared to other types of AI models [54], were developed in 26 reports [14,15,16,17, 19, 20, 23,24,25, 27,28,29,30,31, 34,35,36, 38,39,40,41,42,43,44,45, 47]. All the dataset size appeared highly variable in the studies, as shown in Online Resource 2. The number of cases used to develop models ranged from 10 [33] to 5756 cases [21]. Nevertheless, the maximal number of cases identified from reports may be higher. Indeed, the authors in [39] used more than 12,000 images to develop their models. However, the authors did not detail the presence of identical sources, that is to say if images come from a same individual. Thus, it was not possible to assess the true number of cases used to acquire those images. Therefore, even though a high number of instances were used to develop models, the number of cases considered should be taken into account since a low number of cases may reflect a lack of data representativeness.

In this review, limitations may be identified at first glance. First, only 3 bibliographic databases were explored (PubMed, Scopus, and ScienceDirect) to select articles of interest. One may wonder if the content which may be retrieved from those databases may not be representative of the current knowledge. However, main if not all forensic journals are indexed in these databases. Second, AI applications are rapidly emerging in forensic medicine. Therefore, this field should be regularly monitored to report the state-of-the-art about the usages of AI by forensic medical pathologists or physicians. Third, the final user of AI models was not always obvious and that user may be a specialist such as an anthropologist for models designed from bone-related data analysis or a psychologist for behavior-based algorithm. Moreover, the daily tasks of a forensic pathologist or physician may differ from one country to another, thus making it difficult to determine the final user of a model.

However, this review enables to maintain a good overview of the use of AI applications in forensic medicine through time. First, it reports a state of the art of AI applications used by forensic pathologists and forensic physicians. Second, this review also reports the levels of integration of each AI models included, which enables to follow the evolution of AI applications from the concept to their use in medicolegal practice. Therefore, this review may later report a history of AI applications developed for forensic medicine purposes. Finally, the search equations (see Table 1) enable to easily extract the articles of interest and update the review regularly in order to report the future usages of AI in forensic medicine.

To sum up, the analysis grid given in Table 2 and derived from the TRIPOD checklist enabled us to analyze several aspects of predictive models development in the articles, such as the input data characteristics or the model performance. All the features described in the TRIPOD checklist enable to provide transparency regarding the model development process. Globally, all the articles described TRIPOD features but only 5 articles seemed to follow the TRIPOD guidelines completely without any lack of critical information. However, 30 articles lacked at least one feature, critical or not, from the TRIPOD guidelines such as a clearly defined distribution of datasets or the number of participants in a study, which prevent any complete assessment of model applicability. Therefore, if a model development process does not provide enough information or does not report explicitly or correctly any critical criterion given in Table 2, the resulting model could not be directly transposed to a medicolegal routine use.

It is worth mentioning that the model performances described in the articles are highly heterogeneous with a majority of articles highlighting good model performance, which suggests that models may perform well in daily practice. However, when diving deeper into the model development process, one may notice that the models may not be applicable to medicolegal practice due to several factors, such as real cases meeting one or several exclusion criteria of the sample or population of study for instance. Moreover, the apparent lack of data to develop or validate AI models in the corpus of articles is hurdle to the application of models in daily practice, since such lack would not provide sufficient confidence or reliability to use those models. Furthermore, it may be difficult to understand how advanced AI models, such as dense neural networks, make decisions or predictions, so that they may be perceived as black box models. The use of such models in routine may thus be unwanted by forensic pathologists or forensic physicians due to a lack of model explainability or understanding. For all those reasons, model performance is clearly not sufficient to assess model applicability. Besides, the raw performance of algorithms, even evaluated only in laboratory conditions and not yet confronted with the reality of daily practice, must always be looked at from several aspects. First, performance must be evaluated according to several complementary criteria or metrics and never by just one. There is no single criterion to account for the performance of an algorithm. Then, these performances must be confronted with the existing one: does the algorithm do better than what we are currently doing? Finally, this “better,” when it exists, must be studied according to several components. The first is similar to what is called “clinical significance” in the case of classic trials in medicine, for example, for the evaluation of a drug or an intervention. The “statistical significance” weighs nothing against the need for a really and sufficiently increased utility to justify a change of tool or practice. Indeed, if a new algorithm displays a performance of 82% compared to a well-proven, reliable, and installed practice with a performance of 81%, switching from one to the other is not obvious and is not necessarily justified. Other important aspects must be taken into account, such as the modifications either necessary (e.g., new equipment, software change, new data collection, data regulation) or induced by the adoption of this algorithm in current practice.

To this criterion of clinical significance must be added the ethical nature of the use of algorithms. On the one hand, we must keep in mind that these algorithms are developed from data, and that their quality cannot exceed that the quality of the input data. Worse, the use of a biased algorithm tends to reproduce and then reinforce these biases. Biases of gender, age, ethnic origin, and socio-economic level already present in the majority of classic clinical and epidemiological studies are now incorporated into the algorithms. On the other hand, the fact of using an algorithm does not exonerate from keeping in mind, depending on the field of application, what a rate of false positives or false negatives represents the following: as efficient as it is, do we want to ethically take a greater or lesser risk of falsely concluding that an isolated minor is older than 18, or that a third party is involved in a criminal act?

The impact of the adoption of AI by forensic pathologists does not stop at their personal practice of medicine. Indeed, like the use of new techniques such as DNA or neuroimaging in criminology, the introduction of these new more or less autonomous, more or less normative, and biased decision-making tools is and will be examined by the other stakeholders, starting with lawyers and magistrates. The full adoption and acceptability of AI in forensic medicine are therefore also conditional on acceptance by these stakeholders.

Finally, we must not neglect the very practical side of the introduction of AI into our daily practice. In order to be able to use algorithms in the most fluid, secure and reliable way, it is necessary that they can be integrated into a work environment that allows it. In concrete terms, this means, for example, that there is already a suitable information system and quality data collection compatible with the use of AI, as well as practitioners trained in this entire necessary data chain. However, we are generally not very far from it. A general convergence of tools and practice is therefore necessary. More broadly, it seems important to us that scientific societies, national and international, take up this subject of data and AI, and be able to formulate recommendations and guidelines to good practice concerning their use.

Conclusion

In forensic medicine, the AI applications meant to be used by forensic pathologists or physicians in daily practice are mainly intended to thanatology and clinical forensic medicine purposes. The main expertise fields in which AI applications are developed are postmortem identification, the determination of the causes of the death, the estimation of the postmortem interval, and the estimation of the age of living individuals. However, according to the literature, no AI application seems to be daily used by forensic medical doctors since the AI models remain in research and development stages. This may be explained by low or overestimated model performances, a lack of representative datasets, or the introduction of biases into AI models. Moreover, the implementation of AI in medicolegal practice does not only concern forensic pathologists or physicians but also magistrates and barristers since medicolegal expertise is intended for justice institutions. Therefore, AI should be appropriated by forensic pathologists and physicians as well as legal professionals to be integrated in forensic medicine practices.