Instruction

Traumatic brain injury (TBI) is a major global health problem, with 950,000 TBI cases per year [1], which is mainly triggered by falls and road injuries. With an increasing population density, the aging of the population, and the use of motor vehicles, motorcycles, and bicycles, the incidence of TBI may continue to increase over time, and patients with TBI are prone to death and disability [2]. Among various injuries, TBI accounts for 1/3 to 1/2 of all deaths from injuries and is the leading cause of disability in patients aged below 40 [3]. TBI has caused a heavy burden in low- and middle-income countries due to the high incidence, mortality, disability, social economic losses, and reduced life expectancy and life quality [4]. Therapeutic treatments are often determined based on the prognosis of patients. Hence, reliable prognosis is essential for a neurosurgeon to decide whether to perform surgical treatment [5]. The use of calculation tools for the predication of TBI patient’s prognosis facilitates physicians to increase or reduce the frequency of interventions in patients with good or poor prognoses respectively.

The scoring tools currently applied in clinical practice to judge the prognosis of TBI patients include the Glasgow Outcome Scale and Head Marshall CT Classification [6, 7], with differences in the use of scoring tools for different wards, for instance, SOFA (Sequential Organ Failure Assessment), APACHE II (Acute Physiology and Chronic Health Disease Classification System II), and GCS (Glasgow Coma Scale) are often used in intensive care units (ICUs) to roughly assess the prognosis of patients [8]. However, diversified evaluation nodes and usage scenarios often make these tools fail to achieve an expected predictive performance in the precise stratification of TBI patients. For example, there are controversies between the GCS score and the Full Outline of Unresponsiveness (FOUR) Score, although the latter was developed to address the evaluation shortcomings of the former [9, 10].

Moreover, there are currently many types of scoring tools available, but neither comprehensive and systematic exploration of the predictive factors related to prognosis, nor horizontal comparison has been reported. Whether there are differences in predicting prognosis between these tools and how to choose an appropriate clinical scoring tool remain to be elucidated. Over the years, some scholars have applied the machine learning(ML) method to establish predictive models for patients with traumatic brain injuries(TBI), including neural network(NN), Naive Bayes(NB), logistic regression(LR), decision tree(DT), and support vector machine(SVM) [11, 12]. ML prediction models differ from previous scoring tools in predictive factors. However, there is a lack of evidence to underpin that the prediction performance of predictive factors in ML models is better than that of traditional scoring tools. Moreover, whether ML prediction models have better predictive performance than clinical scoring tools remains controversial. The latest guidelines for severe brain injury management do not provide an answer to this question. Therefore, we conducted this systematic review and meta-analysis to explore the predictive value of ML versus traditional scoring tools in the mortality risk stratification of TBI patients.

Methodology

The systematic review was implemented in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Reviews (PRISMA-DTA) statements. The study protocol has been registered in the International Prospective Register of Systematic Reviews (PROSPERO) and approved prior to the start of the study (ID: CRD42022377756).

Inclusion and exclusion criteria

Inclusion criteria

  1. (1)

    The study subjects are patients diagnosed with TBI;

  2. (2)

    The study types are case-control studies, cohort studies, nested case-control studies, and case-cohort studies;

  3. (3)

    A complete prediction model for mortality risk was constructed;

  4. (4)

    Studies without external verification were also included;

  5. (5)

    Different studies on ML algorithms were published based on the same data set;

  6. (6)

    Studies reported in English were included.

Exclusion criteria

  1. (1)

    Studies of the following types: meta-analysis, review, guidance, expert opinion, etc.;

  2. (2)

    Only risk factors were analyzed, and no complete ML model was constructed;

  3. (3)

    Studies lacking the following outcome measures which are used to assess the accuracy of the ML model: Roc, C-statistics, C-index, sensitivity, specificity, accuracy, recovery rate, accuracy rate, confusion matrix, diagnostic four-grid table, F1 score, and calibration curve;

  4. (4)

    Studies on the accuracy of single-factor prediction.

Retrieval strategy

As of November 27, 2022, a systematic retrieval was conducted in the databases including PubMed, Embase.com, Cochrane, and Web of Science. The retrieval terms include subject terms and free text words. The following terms (including synonyms and closely related words) are used as index terms or free text words: "traumatic brain injuries", and "machine learning". Only peer-reviewed articles are included. The complete retrieval strategy for all databases is provided in Table S1.

Literature screening and data extraction

We imported the retrieved literature into Endnote, screened the initial eligible original studies by title and abstract after checking for duplication, downloaded the full texts of the potentially eligible studies, and then determined the final eligible original studies according to a full-text review. Before data extraction, a standard data extraction spreadsheet was prepared. The extracted data contained the title of the literature, the author of the literature, year of publication, country of the author, study type, patient source, background of TBI occurred, time of death, number of dead samples, total sample size, number of dead samples in the training set, total sample size in the training set, generation method of the verification set, number of dead samples in the verification set, sample size of the verification set, processing method for missing values, variable screening method, type of model used, and modeling variables. The literature screening and data extraction mentioned above were implemented by two independent investigators (JW and MJY). Dissents, if any, were solved by a third investigator (HCW).

Risk of bias assessment

Two investigators used the prediction model risk of bias(ROB) assessment tool (PROBAST) to assess the ROB of each model in each study and the applicability of our reviewed questions [13]. Disagreements, if any, were solved by a third investigator. Based on a series of specific questions, each model was respectively assessed as having a "high", "unclear" or "low" ROB in four domains (participators, predictors, outcomes, and analysis). The same scale was used to assess the applicability of each model to our reviewed questions in three domains (participators, predictors, and results).

Outcome measures

To quantitatively summarize the prediction performance of ML models, we extracted discrimination and calibration measures. The discriminative ability quantified the capacity of the models to distinguish individual development from non-development results. The C-index or the area under the receiver operating characteristic curve (AUROC) with a 95% confidence interval (95% CI) was extracted. When the number of death events was very small in original studies, the C-index was insufficient to reflect the accuracy of models in predicting death cases, so our outcome measures also included sensitivity and specificity. If none of such outcomes were available but the AUC value was reported, sensitivity and specificity were obtained from the Roc curve based on the optimal Youden index.

Statistical methods

Meta-analysis was performed on the C-index and accuracy of ML models. If the C-index lacks 95% CI and standard errors, we estimated the standard errors by referring to the studies of Debray TP et al. [14]. Given the differences in the variables included in the ML models and the inconsistency in parameters, a random-effects model was preferred for the meta-analysis of the C-index. In addition, a bivariate mixed-effects model was adopted for the meta-analysis of the sensitivity and specificity. The meta-analysis was conducted using R4.2.0 (R development Core Team, Vienna, http://www.R-project.org).

Result

Literature screening results

The study selection process is shown in Fig. 1. A total of 3,701 unique records were identified, and 76 reports were reviewed. After excluding studies with no full texts, conference abstracts, registration protocols, failure to provide any outcome measures for ML accuracy, and failure to create a new ML model, 47 studies were finally included [11, 12, 15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59].

Fig. 1
figure 1

Literature screening process

Basic characteristics of the included literature

The 47 included studies were from Asia (N = 17, 36%), Europe (N = 16, 34%), America (N = 12, 26%), and Middle East (N = 2, 4%). The total number of subjects was 2,080,819. The study type included case-control studies (N = 32), prospective cohort studies (N = 10), and retrospective cohort studies (N = 4). Most articles were mainly published in 2022 (N = 10), followed by 2020 (N = 7), and 2018 (N = 6). Totally, 156 models were included, including 122 newly-developed ML models, of which there were 98 ML models for predicting in-hospital mortality, and 24 ML models for predicting out-of-hospital mortality. The ML models for predicting in-hospital mortality mainly included the SVM (N = 16), DT (N = 13), LR (N = 12), random forest(RF) (N = 10), K-Nearest Neighbor(KNN) (N = 8), and NN (N = 8).

Predictors

In the newly-developed ML models, the most common variable was the GCS score (N = 88; 9.6%), age (N = 74; 8.1%), CT classification (N = 72; 7.9%), pupil size/light reflex (N = 31; 3.4%), glucose (N = 30; 3.3%), systolic blood pressure (N = 30; 3.3%), hypotension (N = 15; 1.6%), heart rate (N = 15; 1.6%), and hypoxemia: (N = 14; 1.5%). The specific predictors are presented in Table S2.

Risk of bias assessment result

Most study subjects were distributed in case-control studies, and the ROB was low in the domain of participants. Regarding the domain of predictors, there was a high bias in whether predictors were assessed without clear result data (N = 83; 68%). Because most studies included were case-control studies, and only a small number of them were prospective studies and retrospective cohort studies, the ROB in outcomes was low or unclear. As for the statistical analysis, there was a high bias (N = 67) in terms of whether the sample size was reasonable, which was mainly caused by the lack of an independent verification cohort or a sample size that was less than 100 for verification. There was a small amount of high bias in terms of whether the selection of predictors by a single-factor analysis was avoided, while the rest were at a low or unclear ROB (Fig. 2).

Fig. 2
figure 2

ROB assessment result

Meta-analysis Result

In-hospital mortality

Newly-developed machine learning models

The pooled C-index of 98 ML prediction models in predicting in-hospital mortality was 0.86 (95% CI: 0.84, 0.87) in the training set (Fig. 3). The sensitivity and specificity of 64 models were reported, and the pooled sensitivity and specificity were 0.79 (95% CI: 0.75, 0.82) and 0.89 (95% CI: 0.86, 0.92), respectively. For the in-hospital mortality, the verification set only included LR (N = 3) [C-index: 0.86 (95% CI: 0.81, 0.90); sensitivity: 0.59 to 0.66; specificity: 0.93 to 0.92]. The most common algorithms in ML prediction included the SVM (N = 16) [C-index: 0.87 (95% CI: 0.83, 0.92); sensitivity: 0.86 (95% CI: 0.80, 0.91); specificity: 0.87 (95% CI: 0.74, 0.94)], the DT (N = 13) [C-index: 0.82 (95% CI: 0.76, 0.88); sensitivity: 0.83 (95% CI: 0.67, 0.92); specificity: 0.93 (95% CI: 0.78, 0.98)], the LR (N = 12) [C-index: 0.88 (95% CI: 0.83, 0.93); sensitivity: 0.68 (95% CI: 0.58, 0.77); specificity: 0.91 (95% CI: 0.85, 0.95)], the RF (N = 10) [C-index: 0.85 (95% CI: 0.81, 0.89); sensitivity: 0.86 (95% CI: 0.73, 0.93); specificity: 0.92 (95% CI: 0.76, 0.98)], the NN (N = 8) [C-index: 0.91 (95% CI: 0.88, 0.93); sensitivity: 0.78 (95% CI: 0.57, 0.91); specificity: 0.92 (95% CI: 0.87, 0.95)], and the KNN (N = 8) [C-index: 0.79 (95% CI: 0.70, 0.88)] (Tables 1 and 2).

Fig. 3
figure 3

Forest map of c-index prediction of in-hospital deaths by newly developed ML models and clinically recommended tools

Table 1 C-index of prediction models for in-hospital and out-of-hospital mortality in the training set and verification set
Table 2 Sensitivity and specificity of prediction models for in-hospital and out-of-hospital mortality in the training set and verification set

Mature Clinically Recommended Tools

There were 16 clinical scoring tools and some tools externally validated in large-scale cohort studies for predicting the in-hospital mortality, with a pooled C-index of 0.86 (95% CI: 0.85, 0.88) in the training set. The sensitivity and specificity of 11 tools were reported, and the pooled sensitivity and specificity were 0.71 (95% CI: 0.64, 0.78) and 0.84 (95% CI: 0.78, 0.89), respectively. In the verification set, there were only 3 tools predicting in-hospital mortality, with a pooled C-index of 0.687 (95% CI: 0.525, 0.848), with no sensitivity or specificity (Figs. 3, 4 and 5a, b).

Fig. 4
figure 4

Forest map of c-index prediction of out-of-hospital deaths by newly developed ML models and clinically recommended tools

Fig. 5
figure 5

Forest plots of sensitivity of newly developed ML models and clinically recommended tools to predict in-hospital death. b Forest plots of specificity of newly developed ML models and clinically recommended tools to predict in-hospital death

Out-of-hospital mortality

Newly developed machine learning models

There were 24 ML prediction models predicting the out-of-hospital mortality in the training set, with a pooled C-index of 0.83 (95%CI: 0.81, 0.85). The sensitivity and specificity 4 models were reported, and the pooled sensitivity and specificity were 0.74 (95%CI: 0.67, 0.81) and 0.75 (95%CI: 0.66, 0.82), respectively. In the verification set, there were 9 prediction models predicting the out-of-hospital mortality, and the pooled C-index was 0.82 (95%CI: 0.81, 0.83), with no sensitivity or specificity. The most common algorithm of ML prediction models for out-of-hospital mortality was the LR (N = 10) [C-index: 0.84 (95%CI: 0.81, 0.88); sensitivity: 0.70 to 0.77; specificity: 0.61 to 0.75] (Tables 1 and 2).

Mature Clinically Recommended Tools

In the training set, there were 18 clinical scoring tools and some tools externally validated in large-scale cohort studies for predicting out-of-hospital mortality, and the pooled C-index was 0.817 (95%CI: 0.789, 0.845). The sensitivity and specificity of 7 tools were reported, and the pooled sensitivity and specificity were 0.75 (95%CI: 0.68, 0.80) and 0.79 (95%CI: 0.73, 0.84), respectively. In the verification set, there were only 4 tools predicting out-of-hospital mortality, and the pooled C-index was 0.813 (95%CI: 0.789, 0.836), with no sensitivity or specificity.

Discussion

Main findings

Our study is the first of its kind to systematically review the performance of ML in predicting the mortality of patients with TBI. According to the meta-analysis of 47 studies, ML models are highly accurate in predicting the mortality of patients with TBI. For instance, the C-index of most ML models such as the SVM, DT, LR, RF, and NN was greater than 0.8, especially the C-index of the NN, which was greater than 0.9. In addition, our studies also included some clinical scoring tools. The C-index of GCS predicting in-hospital mortality was 0.88 (95% CI: 0.85, 0.89), the sensitivity was 0.50 to 0.51, and the specificity was 0.90. Further, external validated tools in large-scale cohort studies were also included, such as CRASH core [C-index: 0.92 (95% CI: 0.89, 0.96); sensitivity: 0.73; specificity: 0.90] and IMPACT ct [C-index: 0.88 (95%CI: 0.84, 0.91); sensitivity: 0.80; specificity: 0.78]. In terms of in-hospital mortality, there was no huge difference in the overall prediction performance of ML and traditional scoring tools, but the prediction performance of the NN and batch normalization was outstanding. Regarding out-of-hospital mortality, the prediction performance of ML was slightly superior to traditional clinical scoring and some tools externally validated in large cohort studies. Recently published studies have revealed that there is a trend that the performance of the ML model is superior to that of LR, which may be related to the recent improvements in ML algorithms and the increasing popularity of special statistical software. The clinical decision support system should be constantly updated and improved, to meet the high requirements for risk prediction models in clinical practice, and further realize multidisciplinary collaborative decision-making as well as the benchmarking and monitoring of achievements. However, for predicting out-of-hospital mortality, LR is still the most frequently used model, which is easy to understand and realize owing to its simplicity, with no need for complex mathematical knowledge. In addition, LR can better interpret the result as well as the influence of variables on the result by coefficients, making it easy to analyze and understand. Furthermore, LR is easy to realize, because most programming languages provided corresponding bases and tools, making it easy to develop and apply.

Clinical practice findings

In clinical practice, the initial assessment and treatment of acute TBI are still dependent on the GCS scale. GCS remains the most widely used scoring system for assessing the severity of TBI because it is fast and direct. Despite its utility, GCS also has important limitations, and cannot provide sufficient information, especially in the diagnosis of mild sub-clinical TBI and the prediction of TBI lineage. In the case of mild sub-clinical TBI, patients may have no obvious neurological symptoms, but GCS can only determine the score by evaluating the eye, oral, and limb reactions. Therefore, it may be not sensitive enough and may fail to detect mild TBI, especially mild cognitive and emotional disorders, which may affect the life quality and work capacity of patients. In addition, GCS cannot predict the recovery and long-term prognosis of patients. Although GCS can provide estimated information about the severity of brain injuries, it is still inaccurate to predict the recovery conditions and prognosis of patients by merely relying on the GCS score [60]. Although GCS can be used to predict in-hospital mortality, objectivity and accuracy are still poor, for instance, the speech components of intubated patients cannot be accurately assessed. In such cases, some doctors give the lowest possible score, while others infer the speech reaction score from other neurological manifestations [61]. Besides, GCS cannot detect subtle clinical changes in unconscious patients, such as abnormal brainstem reflexes and abnormal limb postures [62]. Add four score: therefore, we also pay attention to the comprehensive nonresponsiveness scale (four score) for predicting mortality in TBI. It is a new coma scale developed for the limitations of GCs, including eye opening (E), motor function (m), brainstem reflex (b) and respiratory mode (R). Compared with GCS, four score has similar mortality prediction ability, It seems that it can be used as an alternative to predict the early mortality of TBI patients in ICU [63]. Because it has some advantages, for example, in intubated patients, all components of the four score can be scored, but the GCS score cannot be scored [64]. However, its advantages are not obvious. Evidence based medicine evidence shows that the GCS and four scores are of equal value in predicting in-hospital mortality and adverse outcomes. The similar performance of these scores in the evaluation of TBI patients allows medical staff to choose to use any of them according to the situation at hand [65]. And we found that there is no comparative study between ML and the total nonresponsiveness scale in TBI.

In contrast, ML models use a large amount of real data for training and prediction, have wider coverage, contain more variables, and can reveal potential correlations and rules. The development of clinical scoring tools is inevitably subject to the subjective influence of investigators, and some variables may be ignored or over-considered. Additionally, ML models can carry out accurate and detailed predictions based on manifestations and characteristics of cases, thereby avoiding clinical scoring tools’ problem that their results may be too vague and too generalized. Furthermore, ML models can not only learn and optimize prediction algorithms adaptively, thus improving prediction accuracy, but also can implement personalized prediction and design of therapeutic plans based on the characteristics and historical data on patients [66]. However, clinical scoring tools tend to recommend templates and standardized therapeutic plans, which cannot satisfy the needs of different patients. Finally, ML prediction models can process various types of data, including imageology, laboratory, and biochemical information, to help doctors comprehensively evaluate the conditions of patients, while clinical scoring tools usually focus only on a few important indexes, so some implicit influencing factors may be ignored.

ML models used for medical purposes have their respective specificity and applicability, and different models are required for different diseases and clinical problems [67, 68]. However, due to the lack of studies with large sample sizes and in absence of a clear consensus, it is difficult to select a suitable model. Therefore, it is necessary to comprehensively analyze different algorithms and hyper-parameters to find the optimal model [69], and meanwhile consider the precision, complexity and training time of the model. Finally, the performance of the selected model can be assessed by the techniques such as cross-verification, so as to ensure its best efficiency in practical application [70]. ML algorithms make some hypotheses for the relationship between predictors and the target variables, which is called learning deviation, and introduce deviation into the model. The hypothesis made by ML algorithms indicates that some algorithms are more suitable for certain datasets than other algorithms [71]. Therefore, the degree of improvement by ML and the clinical influence are still uncertain. Nonetheless, it is certain that the big data analysis of ML is highly effective for the absorption and assessment of massive complex medical healthcare data. In addition, among the variables we calculated, the GCS score, age, CT classification, pupil, and glucose assay ranked top. GCS, as a clinical scoring tool, has become a variable in the ML model, which is not beyond our expectations, because it contains some key indexes for assessing the prognosis such as language, awareness, and action [72]. Interestingly, we have found the significance of CT classification in predicting the prognosis of TBI, which conforms to the clinical application of CT imaging to intuitively assess TBI of patients, predict the prognosis, help determine therapeutic strategies, and provide information for preparing such rehabilitation plan [73]. However, given the huge amount of CT imaging data, hundreds to thousands of images can be obtained per scanning, with complex data, including many different tissues and structures. Sometimes, artifacts may also affect the quality of images. Therefore, it is difficult to make progress in the traditional studies on CT data and clinical disease changes. Nevertheless, ML can automatically process a large number of clinical CT imaging data and extract useful characteristics, with no need for manual intervention, significantly lowering the time and costs required for clinical CT imaging data processing and analysis, and improving work efficiency. ML algorithms can rapidly and accurately identify and position key organs, structures, and anomalies, and further help doctors make more accurate and scientific medical decisions. In addition, it can also make personalized therapeutic plans, allowing more accurate treatment for patients and improving therapeutic effects [74]. In the future, studies on ML and CT imaging will be further implemented, and ML will be applied in a more extensive range. In this process, larger cohort studies are required to support its clinical application. Meanwhile, interdisciplinary cooperation becomes more and more important, including the cooperation of experts from the medicine, computer science, and mathematical sectors, to speed up the application and development of ML in CT imaging.

Our advantage is that we have conducted a comprehensive retrieval strategy and a throughout analysis. We have included any models for predicting the mortality of TBI, including not only ML models but also traditional tools externally validated in large cohort studies and traditional scoring tools. The most important advantage of ML algorithms is to more effectively handle missing data, and implement more complex calculations. It can perform a non-linear prediction of outcomes, without relying on the data distribution hypothesis, such as the independence and multiple collinearities of observation. Owing to outstanding model identification capacity, ML can help clinicians make decisions, and it has also been demonstrated to have an outstanding performance in identifying endocrine diseases [75] and predicting mortality of inflammatory diseases, etc. [76] .

Our study also has certain limitations. Firstly, the study only included the articles written in English, so it may fail to fully consider the relevant literature in other languages. Secondly, data conversion was used in data extraction, which may result in certain deviations. The AUC curve, sensitivity, and specificity were not reported in some of the included studies. Finally, although the meta-analysis has fully summarized the characteristics and trends of relevant literature, more refined explorations are still required, for instance, on the prediction capacity of ML for TBI mortality of different patient groups in different regions, which can provide a clear understanding of the actual feasibility of ML in predicting TBI mortality.

Conclusion

According to the systematic review and meta-analysis, ML models can predict mortality of TBI, with an excellent discriminative ability in case-control studies. The key factors related to the model performance include the accepted clinical variables of TBI and the use of CT imaging. Although a single model is often superior to traditional scoring tools, the assessment of the summary performance is restricted by the heterogeneity of studies. Therefore, it is necessary to develop a standardized report guideline on ML for TBI. Currently, ML models are rarely used clinically, so it is in urgent demand for determining their clinical influence on different patient groups, to ensure their universality and apply the theoretical knowledge to clinical practices.