Introduction

Early treatment of stroke with large vessel occlusion (LVO) requires rapid transport to a thrombectomy-capable hospital for early recanalization1,2. Similarly, timely transport to specialized facilities is critical for treating subarachnoid haemorrhage (SAH) and intracerebral haemorrhage (ICH), as it can significantly improve patient outcomes3,4. The American Stroke Association recommends recognizing stroke accurately, activating emergency medical services (EMS), triaging to the appropriate hospital, and designating a competent stroke centre5. However, accurately diagnosing stroke in prehospital settings can be challenging for EMS personnel due to resource constrains and similarity of symptoms between different types of strokes, potentially leading to delays in hospital arrival or misdiagnosis. To address this issue, various prehospital scales targeting LVO have been developed to aid in determining indications for thrombectomy6,7,8,9, and the prehospital diagnosis of SAH and ICH is also important for acute stroke treatment10,11. Moreover, recent advances in machine learning (ML) and deep learning (DL) models have shown promising results in improving prehospital diagnosis scales12,13,14.

In the medical field, ML and DL models have demonstrated their potential in computer-aided diagnosis, helping healthcare professionals make accurate and timely diagnoses. For instance, DL-based image classification has been used to diagnose diseases such as pneumonia, breast cancer and lung cancer15,16,17. The ML-driven segmentation of medical images has enabled the detection of regions of interest18, and feature extraction techniques have been used to extract relevant features from medical images or other data for developing diagnostic tools19,20. ML models, as decision support systems, have been developed and used to assist clinicians in diagnosing and treating diseases [e.g., heart diseases21]. To improve the accuracy of machine learning models, there are generally three methods of hyperparameter optimization. Grid Search, Random Search, and Bayesian Optimisation. Grid search is simple and easy to implement. However, it is computationally expensive when the hyperparameter space is large, and it doesn't learn adaptively from previous iterations. Random search allows to explore the hyperparameter space more efficiently compared to grid searching. But because it's random, there is no guarantee that the best hyperparameter combination will be found. Bayesian optimization efficiently explores the hyperparameter space by using a probabilistic model to guide the search. It adapts the search based on previous evaluations, improving efficiency. It uses a learning function to select the next hyperparameter configuration to evaluate, balancing exploration and exploitation.

Building upon these recent advances, our study aimed to develop an ML-driven decision support system to help EMS personnel diagnose patients with consistent accuracy. We collected stroke-related information from the records of patients with suspected stroke who were transported by EMS to a single secondary medical care area as part of the Smart119 project. In our previous paper, we analysed the data using ML models and presented a stroke prediction scale that includes the diagnosis of stroke categories22. In this work, considering that the prehospital selection of patients requiring surgical treatment, rather than the diagnosis of stroke subtypes, would contribute to more appropriate transport, we examined the prehospital predictive diagnosis of patients who actually required surgical intervention based on case data from the Smart119 project. Our findings suggest that by integrating ML into prehospital decision support for EMS personnel, it is possible to improve patient outcomes by enabling appropriate and timely transport of patients requiring stroke surgical treatment.

Results

Baseline characteristics and outcomes

Patient characteristics and clinical findings in the study model are shown in Tables 1, 2, 3, S1S3. There was no significant difference in patient background between the two groups; however, the intervention group had a significantly shorter time from onset to command (Table 1). In terms of the level of consciousness, treatment intervention was less common in patients with code alerts (Japan Coma Scale [JCS] 0, Glasgow Coma Scale [GCS] E4V5M6) and significantly more common in patients with codes JCS 3–100 and GCS E3/V2/M4-5 (Table 2). Vital signs and symptoms that required considerably more intervention were sudden headache, vomiting, hemiparesis, conjugate deviation, aphasia, and dysarthria (Table 3).

Table 1 Baseline characteristics in the training cohort.
Table 2 Level of consciousness in the training cohort.
Table 3 Vital signs and symptoms in the training cohort.

Prediction of prehospital stroke surgical intervention

Four popular ML algorithms were used to predict the need for stroke surgical intervention: eXtreme Gradient Boosting (XGBoost), Logistic Regression, Random Forest, and Support Vector Machine (SVM) as a representative of a gradient boosting algorithm, linear algorithm, tree algorithm, and dimensionality reducer and classifier (Table S4). In the training cohort, analysis using Random Forest predicted surgical intervention in stroke patients with high performance (an area under the receiver operating characteristic curve [AUROC] of 0.882, a sensitivity of 0.862, and a specificity of 0.746). When applied to the test cohort, the XGBoost model performed the best and predicted surgical intervention with higher scores than other models, achieving an AUROC of 0.802 (sensitivity 0.719, specificity 0.774) (Table 4, Fig. 1). The Shapley Additive exPlanation (SHAP) summary plot revealed that the major predictive contributors for stroke intervention were “Japan Coma Scale”, “dysarthria”, “heart rate”, “age”, “sudden headache and/or unconsciousness”, “Glasgow coma scale (V)”, “time from onset to emergency call”, “body temperature”, “aphasia”, and “oxygen saturation” (Fig. 2).

Table 4 Prehospital stroke prediction for intervention using XGBoost.
Figure 1
figure 1

Area under the receiver operating characteristic curves of machine learning models. The receiver operating characteristic curve of prehospital prediction algorithms for stroke requiring surgical intervention is depicted with 1-specificity on the x-axis and sensitivity on the y-axis using the training cohort (a) and the test cohort (b). The 95% confidence interval of the AUROC is also shown. AUROC area under the receiver operating characteristic curve.

Figure 2
figure 2

SHAP value of stroke surgical intervention. The impact of the features on the model output was expressed as the SHAP value. The features are placed in descending order according to their importance. The association between the feature value and SHAP value indicates a positive or negative impact of the predictors. The extent of the value is depicted as red (high) or blue (low) plots. SHAP SHapley Additive exPlanation.

Discussion

The present study demonstrated that a prehospital scale could predict stroke requiring surgical intervention with high accuracy. Although prehospital stroke diagnostic scales have been published in many countries and scales have been developed to determine the severity of stroke23,24, to the best of our knowledge, this is the first scale that predicts the need for surgical intervention. When surgical intervention is needed for any type of stroke, rapid transport is necessary, and hospitals need to be prepared for this. Therefore, this scale, which can predict the need for surgical intervention before hospital arrival, is very useful for rapid patient transport.

The most important variables for diagnosis were found to be JCS, vitals (pulse, temperature, oxygen saturation), age, time from onset to emergency call, headache, and speech abnormalities. Interestingly, a detailed neurological examination was not among them (Fig. 3). These variables identified as most important variables are simple survey items, and we believe that the scale is composed of easily obtainable data that EMS teams routinely observe. Regarding the absence of a detailed neurological examination including paresis, it is interesting to note that the focus should be on other items that reflect disease severity since paresis and neurological deficits are observed even in minor strokes that do not require surgical intervention.

Figure 3
figure 3

Feature importance. The impact of the features on the model output is expressed as the average of the absolute SHAP value. The larger the value, the more important is the feature for predicting stroke surgical intervention. SHAP SHapley Additive exPlanation.

The prehospital stroke scales that have been published to date have enabled the diagnosis of LVO with high accuracy and stroke subtypes. However, in all of the scales, stroke-specific survey items were important, such as conjugate deviation and hemispatial neglect25,26,27. As noted above, these items were not included in the key variables in this scale, suggesting that its usefulness could be maintained even when these items were missing, such as in cases in which stroke was not suspected.

The novel prehospital scale developed in this study can predict the need for surgical intervention across all stroke diseases. In comparison, the Shonan Prehospital Scale (SPSS) is a score used at the municipal level to predict surgical intervention. The SPSS evaluates severe headache, impaired consciousness (JCS ≥ 10), and local symptoms (hemiplegia, facial paralysis, or abnormal speech), scoring 1 point if the onset is severe and 2 points if it is sudden onset. Comparative validation of the two models using the present data revealed that the newly developed model was superior to the other with an AUROC of 0.652 (sensitivity 0.880, specificity 0.425) (Table S5).

The patient information system utilized in the Smart119 project stores patient data gathered by EMS personnel via tablets. The interface of the system is equipped with an application that enables prehospital stroke diagnosis. We believe that the inclusion of this program into the existing system would reduce the time required for hospital selection and contribute to prompt and appropriate emergency transport.

This study has some limitations. First, the decision to initiate therapeutic intervention was made by the neurosurgeons at each participating hospital, which may have introduced variability across institutions. Second, although this was a multicentre study, the study was limited to a single metropolitan area in Japan. Hence, it is crucial to validate the algorithm’s high predictive value in other regions with distinct characteristics to increase its applicability across Japan. Fortunately, the medical region where this study was conducted (Chiba Prefecture) comprises diverse types of medical organizations, including urban type with multiple hospitals, independent type with one hospital as the main hospital, and depopulated type with no central hospital. The algorithm will be expanded to Chiba Prefecture as a whole and will be demonstrated in the future.

In conclusion, our algorithm serves as a prehospital stroke scale that can be easily completed by EMS personnel to predict the need for surgical intervention in patients with stroke. We firmly believe that our machine-learning-based scale holds significant value as predicting stroke intervention is important in determining a suitable transport destination considering their medical care system.

Methods

Study design and patient population

From September 2019 to January 2022, we conducted a study of patients who were transported by EMS for suspected stroke. The destination hospitals included all 12 medical institutions within the secondary care area that were equipped to transport stroke patients. We developed a surgical intervention prediction scale by retrospectively examining 1143 patients whose diagnosis and treatment plan could be ascertained at the transport site.

Surgical intervention was defined as aneurysmal neck clipping or coil embolization for SAH, haematoma removal, haematoma or ventricular drainage for ICH, administration of intravenous tissue plasminogen activator (tPA), mechanical thrombectomy, or other endovascular treatment for acute ischaemic stroke. The decision to perform interventions was made at the discretion of the neurosurgeon at each institution.

The Chiba University Hospital Certified Clinical Research Review Board approved this study (No. 2733) and waived the need for written informed consent in conformity with the Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan. We posted information about this study in each ambulance. We promptly excluded the collected data when a patient or family indicated that they did not wish to participate in this study.

Selected variables

The survey items for analysis included patients’ characteristics, vital signs, symptoms, level of consciousness, and the 7 key parameters proposed by the Japanese Stroke Association. Details are as follows: (i) patients’ characteristics: age, sex, time from onset to emergency call, onset timing; (ii) vital signs: pulse, blood pressure (systolic/diastolic), body temperature, oxygen saturation; (iii) symptoms: vomiting, dizziness, cramps, numbness; (iv) level of consciousness: JCS, GCS (E, V, M); (v) previous medical history; (vi) important stroke parameters: conjugate deviation, hemispatial neglect by 4-finger method25, aphasia (call of glasses/clock), pulse irregularity, dysarthria, facial paralysis, upper and lower hemiparesis.

Missing values

As our data had missing values, we performed imputations before building the ML models. First, we used domain knowledge to impute pairs of groups of features including (i) conjugate deviation and visual field defects (ii) dysarthria and facial paralysis; (iii) aphasia, GCS, JCS and other consciousness-related features; (iv) systolic and diastolic blood pressure values; and (v) paralysis-related features. For other numerical features, such as heart rate, body temperature, oxygen saturation, and time from onset to emergency call, we imputed with the median value of each feature. The rest of the features with missing values (all of them are categorical features) were left as they were since boosting models such as XGBoost support missing values and treat them as a separate category.

Machine learning model development

We developed ML models using four different algorithms: XGBoost, Random Forest, Logistic Regression and SVM. To ensure a balanced distribution of surgical intervention categories, we randomly assigned 765 cases (70%) to a training cohort and 378 cases (30%) to a test cohort. The stroke types were classified into SAH, ICH, LVO, and other ischaemic stroke in both cohorts. The number of cases and the number of surgical interventions for each type are shown in Fig. S1.

The hyperparameters of the ML models were tuned by using an open-source hyperparameter optimization software framework called Optuna that employs Bayesian optimization algorithm techniques. Optuna helps us to find the best combination of parameters that maximize the model score by iterating the choice of parameters and evaluating the models obtained with those parameters. In each iteration, an evaluation of a model was performed with the scoring method AUROC through fivefold cross-validation.

Statistical analysis

Model performance was measured in terms of the AUROC, sensitivity, specificity, and F1 score. Furthermore, the SHAP algorithm of the XGBoost model, which outperformed all other models, wa employed to interpret the contribution of each variable to the predictive model28. In this algorithm, the SHAP values are calculated by measuring the difference in model output resulting from the inclusion of a variable into the algorithm, providing insights into the impact of each variable on the output. In the SHAP plots, a violin plot was created for all data points associated with each feature, with higher values appearing red and lower values appearing blue. The violin plot is aligned with the SHAP value as the x-axis. Thus, the red/blue violin plot on the right (i.e., higher positive SHAP values) suggests that the higher/lower the value of that feature, the better the model predicts towards positive/negative effects.

Continuous values were expressed as medians (interquartile ranges), and categorical values were presented as absolute numbers and percentages. Two-sided P values less than 0.05 were considered indicative of statistical significance.

Analyses were performed using the open-source Python 3.7.15 package, XGBoost 1.5.1, Sciki-learn 1.0.2, Pandas 1.3.5, Optuna 2.10.1, and SHAP 0.41.0 package. (Python Licence: https://docs.python.org/3/license.html, XGBoost: https://github.com/dmlc/xgboost/blob/master/LICENSE, Scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/main/COPYING, Pandas:http://pandas.pydata.org/pandasdocs/stable/getting_started/overview.html?highlight=license, Optuna: https://github.com/optuna/optuna/blob/master/LICENSE, SHAP: https://github.com/slundberg/shap/blob/master/LICENSE. All figures in this study were drawn using Matplotlib (3.2.2)20,21, a Python visualization package.