Background

Cerebrovascular disease (CD) is a leading cause of death and disability worldwide. The World Health Organization has reported that more than 6 million deaths can be attributed to CD each year [1]. In China, about 13 million people suffered from stroke, a subtype of CD [2]. Although hypertension, high-fat diet, smoking, and alcohol consumption are well-known risk factors for CD, evidence from epidemiological studies indicates that short-term environmental exposure, such as air pollution and extreme weather conditions, also has an important impact on the onset of CD, resulting in an increased risk of morbidity [3,4,5]. Additionally, toxicological studies have also presented several credible biological mechanistic pathways for the negative health effects associated with air pollution [6,7,8]. For example, air pollution exposure may provoke platelet activation, leading to enhanced blood coagulation and thrombosis formation [9].

The growing morbidity and high treatment cost of CD have caused a heavy burden on the limited healthcare resources. Forecasting the daily number of hospital admissions (HAs) for CD is of practical significance to optimize medical resources and protect public health by providing an early-warning signal against the impending incidence. Common traditional regression methods for time series prediction, such as the gray model, simple exponential smoothing (SES) model, and autoregressive integrated moving average (ARIMA) model, have been widely applied in predicting healthcare service demand [10,11,12,13]. The traditional methods can be easily implemented but have difficulties dealing with multi-factor effects and non-linear mapping, therefore, these studies seldom extract features from factors other than historical demand series.

Machine learning (ML) methods can overcome these disadvantages of traditional regression models [14] and have been applied by a limited number of studies to forecast the demands for healthcare services associated with environmental exposure. For instance, Qiu et al. [15] found that the light gradient boosting machine model outperformed the other five ML models they tested in predicting the peak demand days for cardiovascular disease (CVD) admissions. The researchers also identified that meteorological conditions and air pollutants substantially contributed to prediction accuracy. Bibi et al. [16] used a backpropagation neural network model to predict emergency department visits and found that the model performance was remarkably improved after considering temperature, humidity, and air pollution. Kassomenos et al. [17] discovered that the use of Artificial Neural Network (ANN) resulted in a 15% increase in the coefficient of determination (R2) compared to the Generalized Linear Model (GLM) for forecasting HAs for CVDs.

In recent years, as an advanced part of artificial intelligence, deep learning (DL) have attracted much attention in related fields owing to their strong abilities in capturing potential complex relationships among variables [18]. Khatibi et al. [19] and Wang et al. [20] proposed novel predictive models based on convolutional neural network and long short-term memory (LSTM) to predict HAs due to mental disorders and cardiopulmonary diseases, respectively.

Despite the widespread use of ML and DL models in predicting healthcare demand, these models have their disadvantages. Lightweight ML models have limited prediction capabilities, and each of them has specific predefined structures and assumptions. No single model can always be optimal in various application scenarios. DL models generally rely on massive amounts of training data and a relatively long time window for the input sequence. In this context, the stacking ensemble technique [21,22,23] can provide an effective solution to strike a compromise by combining the strengths of multiple lightweight ML models to achieve superior performance using limited amounts of samples. Additionally, most existing studies treated these models as “black boxes” and rarely provided explanations for them, which might reduce their acceptance by the medical community [24]. Thus, it is essential to increase the transparency of ML models in the medical domain.

In this study, we applied stacking ensemble learning based on heterogeneous lightweight ML models to forecast medical demands caused by CD considering short-term environmental exposure and explained the predictions by the SHapley Additive exPlanations (SHAP) method. The main contributions of this study can be summarized as follows:

  1. (1)

    A stacking ensemble model was proposed to predict daily HAs for CD using the HAs data, air quality data, and meteorological data of the previous 6 days.

  2. (2)

    A re-weighting method based on label distribution smoothing was integrated into the proposed model to address the label imbalance problem that broadly existed in healthcare data.

  3. (3)

    A post-hoc interpretation for the prediction mechanism of our proposed model was provided from global and local perspectives, which is conducive to understanding the model and exploring the factors affecting HAs for CD.

Methods

Data collection and preprocessing

The daily counts of HAs due to CD and stroke from January 1, 2015 to December 31, 2018 were collected from the electronic hospitalization summary reports of all the tertiary and secondary hospitals in the urban areas of Chengdu, China (a total of 1461 observations). Patients under the age of 35 or with residential addresses outside of the urban districts of Chengdu were not included in the count of daily HAs. All the causes of HAs were coded using the International Classification of Disease, Revision 10 (ICD-10), including the HAs for CD (I60-I69) and stroke (I60-I64).

Hourly air pollutant concentrations measured at six monitoring stations in the urban areas of Chengdu were obtained from the China National Environmental Monitoring Center (http://www.cnemc.cn/). The 24 h average concentrations of the particular matter with a diameter less than 2.5 µm (PM2.5), the particular matter with a diameter less than 10 µm (PM10), the coarse particular matter with a diameter between 2.5 µm and 10 µm (PMC), sulfur dioxide (SO2), nitrogen dioxide (NO2), carbon monoxide (CO) and the 8 h moving average concentration of ozone (O3) were calculated as their daily concentrations [25]. The air quality index (AQI) was also obtained, assessed using the air pollutants mentioned above. Because the missing rate of ambient air quality data was 2.73% (40/1461), we used linear interpolation, which has been reported as an effective data filling method when the missing rate is low (e.g., < 5%) [26] to fill in missing values.

Meteorological observations, including the daily average temperature (TEM) and relative humidity (RH), were derived from the Chengdu Meteorological Monitoring Database (http://data.cma.cn/).

Feature extraction and normalization

Based on our collected data, the HAs features, environmental features, and calendar features were extracted, as shown in Table 1.

Table 1 Feature descriptions

For the time series of HAs and environmental exposure, lag features were broadly considered in epidemiological studies and HAs predictions [27, 28]. In our study, single-day lag features, namely historical values on day x (x ∈ {1, 2, 3, …, L}) before prediction, and cumulative lag features, including the moving average and standard deviation of historical values during the previous 1 to L days were extracted. Besides, L was set to 6 to represent the short-term effect of environmental exposure as most epidemiological studies [3, 4]. In calendar features, day of the week (DOW), month (MON), season (SEA), year (YEAR), and timestamp (TS) were used to depict the trends of HAs from short to long term. Holiday (HOL), workday (WD), first work day (FWD), and last work day (LWD) were extracted to present the impact of the work-rest schedule in hospitals.

We processed DOW, MON, SEA and YEAR with One-Hot Encoding [29] and normalized features using the min–max normalization [30] as formulated in Eq. (1),

$${X}^{^{\prime}}=\frac{X-\mathit{min}(X)}{\mathit{max}(X)-\mathit{min}(X)}$$
(1)

Model construction

Stacking ensemble method

In this study, a stacking ensemble model comprised of four base learners and a meta learner was proposed to accurately predict the daily number of HAs for CD. As shown in Fig. 1, the development of the stacking model consists of two phases.

Fig. 1
figure 1

Schematic diagram of stacking model development

Initially, we split the whole dataset into a training set, which covered the period from January 1, 2015 to December 31, 2017, and a testing set, which covered the period from January 1, 2018 to December 31, 2018. We denoted them as \({\mathrm{S}}_{train} = ({\mathrm{X}}_{train}, {\mathrm{Y}}_{train})\) and \({\mathrm{S}}_{test} = ({\mathrm{X}}_{test}, {\mathrm{Y}}_{test})\), respectively, where \(\mathrm{X}\) represented the feature set and Y represented the corresponding label set.

In the first phase, the four base learners, including linear regression with L2 regularization (Ridge) [31], random forest (RF) [32], gradient boosting decision tree (GBDT) [33], and ANN [34], were trained and used to make predictions as the input features for the meta learner. To avoid overfitting and improve the generalization capability, a five-fold cross-validation was implemented. We split \({\mathrm{S}}_{train}\) into five subsets in chronological order, and the ith subset was denoted as \({\mathrm{S}}_{i} = \left({\mathrm{X}}_{i}, {\mathrm{Y}}_{i}\right) (\mathrm{i}=1, 2, 3, 4, 5)\). At the ith-fold cross-validation, the jth base learner \((\mathrm{j}=1, 2, 3, 4)\) was trained using the subsets except Si and made predictions on Si, which were recorded as \({\mathrm{P}}_{j}({\mathrm{X}}_{i})\). Consequently, this process was repeated a total of 20 (4 × 5) times, with each base learner making predictions once on each fold. Afterwards, the predictions generated by the jth base learner throughout the five-fold cross-validation were represented as \({\mathrm{M}}_{j}({\mathrm{X}}_{train}) = \{{\mathrm{P}}_{j}({\mathrm{X}}_{1}), {\mathrm{P}}_{j}({\mathrm{X}}_{2}),{\mathrm{P}}_{j}({\mathrm{X}}_{3}), {\mathrm{P}}_{j}({\mathrm{X}}_{4}), {\mathrm{P}}_{j}({\mathrm{X}}_{5})\}\) and treated as a new feature in the new training set. At the meantime, the jth base learner trained at the ith-fold cross-validation made predictions using \({\mathrm{S}}_{test}\), which were recorded as \({\mathrm{Q}}_{ji}\left({\mathrm{X}}_{test}\right)\), and the average of predictions were calculated as a new feature in the new testing set, namely \({\mathrm{N}}_{j}\left({X}_{test}\right)=\frac{1}{5}{\sum }_{i=1}^{5}{Q}_{ji}({X}_{test})\).

Furthermore, to help the meta learner decide which model to apply under a certain circumstance [23], we merged the key features selected by the base learners, namely calendar features and HAs features (as shown in Additional file 1: Fig. S1), into the new training set and the new testing set, which were denoted as \({\mathrm{X}}_{train\_key}\) and \({\mathrm{X}}_{test\_key}\), respectively. Hence, at the end of the first phase, we gained a new training set \({\mathrm{S}}_{new\_train} = ({\mathrm{X}}_{new\_train}, {\mathrm{Y}}_{train})\), where \({\mathrm{X}}_{new\_train}=({\mathrm{M}}_{1}\left({\mathrm{X}}_{train}\right), {\mathrm{M}}_{2}\left({\mathrm{X}}_{train}\right),{\mathrm{M}}_{3}\left({\mathrm{X}}_{train}\right),{\mathrm{M}}_{4}\left({\mathrm{X}}_{train}\right),{\mathrm{X}}_{train\_key})\), and a new testing set \({\mathrm{S}}_{new\_test} = ({\mathrm{X}}_{new\_test}, {\mathrm{Y}}_{test})\), where \({\mathrm{X}}_{new\_test}=({\mathrm{N}}_{1}\left({\mathrm{X}}_{test}\right), {\mathrm{N}}_{2}\left({\mathrm{X}}_{test}\right), {\mathrm{N}}_{3}\left({\mathrm{X}}_{test}\right), {\mathrm{N}}_{4}\left({\mathrm{X}}_{test}\right), {\mathrm{X}}_{test\_key})\).

In the second phase, the meta learner, i.e., the elastic net [35], was trained on \({\mathrm{S}}_{new\_train}\) and then used to make the final predictions on \({\mathrm{S}}_{new\_test}\).

For a suitable architecture of the stacking model, we have tested eight widely utilized lightweight ML models in the preliminary experiment (see Additional file 1: Table S1), and Ridge, RF, GBDT, and ANN were picked as base learners for two reasons: 1) Each of them outperformed other models in daily HAs prediction. 2) Ridge and ANN are classical linear and network ML models, respectively. RF and GBDT are ensemble tree models based on the bagging and boosting methods, respectively. Because the theories of these models are highly heterogeneous, they can obtain insight into the training data from different perspectives and eventually increase the accuracy and robustness of the stacking model [36]. Linear regression with a combination of L1 and L2 regularization (elastic net) was selected as the meta learner because it was widely used in a similar context and can prevent overfitting to a large extent [21].

Re-weighting with Label distribution smoothing (LDS)

In our study, daily HAs data exhibit an imbalanced distribution, where certain target values, especially peaks and troughs, have strikingly fewer observations. For classification tasks, re-sampling and re-weighting are the two main methods to address data imbalance. However, methods based on re-sampling, such as SMOTE [37] and SMOGN [38], are not directly applicable to our task, because the distance between labels was not considered and the intrinsic seasonal pattern of HAs might be damaged. We adopt the LDS method [39] to extend re-weighting schemes to regression tasks, which includes the following steps: First, discretize the continuous target space into finite bins, which can be considered as the empirical label density distribution. Then convolve the empirical label density with a symmetric kernel to calculate the effective label density that accounts for the overlap in the information of nearby labels so that the cost-sensitive re-weighting method can be utilized based on the effective label distribution.

To integrate this approach into our proposed model, in the training process of the meta learner, we used the inverse of effective label density as the weight of training samples when calculating the loss function, and given that the HAs data show a yearly rising trend and an annual seasonal pattern, it is more reasonable to calculate the LDS estimated label density within each year as shown in Fig. 2.

Fig. 2
figure 2

Schematic diagram of label distribution smoothing

Training details and parameters

In our experiment, the ANN were completed using Keras with Tensorflow 2.4.1 as the backend. Other base learners were implemented based on the Scikit-learn 0.24.2 Python library. The computation was performed using AMD Ryzen 74800U with Radeon Graphics 1.80 GHz. In the stacking model, the hyper-parameters of the base learners and the meta learner were tuned with the last 20% of the original training dataset and the last 20% of the new training dataset, respectively. The grid search was applied, and the best hyper-parameter combinations were illustrated in Additional file 1: Table S2.

Model evaluation

To demonstrate the superiority of our stacking model, we compared the stacking model with all base learners and LSTM [20] on the testing set. In the LSTM, the input variables only included historical HAs and environmental features, and the time window of the input sequence was set to 6, which was consistent with the lag days of other methods. The hyper-parameters of the benchmarks were also tuned with the last 20% of the original training dataset (shown as Additional file 1: Table S2).

Four metrics, including mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R2) were used to evaluate the effectiveness of the prediction models as Hu et al. [22].

Model explainability

To explain the predictions of our final model, we made use of the permutation explainer implemented in the SHAP Python library (version 0.39.0). SHAP [40] is a unified approach based on the additive feature attribution method that interprets the difference between an actual prediction and the baseline as the sum of the attribution values, i.e., SHAP values, of each feature. In this study, the SHAP value for each feature in a given sample of CD dataset was calculated based on our proposed stacking model to present its contribution to the variation of HAs predictions. For the historical HAs and environmental features, their SHAP values were regarded as the sum of the SHAP values of all single-day lag and cumulative lag features, rendering their contributions during the previous 6 days.

A post-hoc interpretation was provided by analyzing the SHAP values from two perspectives. On the global scale, the SHAP values over all training samples were holistically analyzed to reveal how the stacking model fits the relationship between daily HAs and predictors. On the local scale, the SHAP values in several samples selected from the testing set were investigated to disclose how the predictions were generated in the effect of environmental features.

Results

Descriptive statistics

The summary statistics of daily HAs, air pollutants, and meteorological indicators are shown in Table 2, and the corresponding temporal variations of them are visualized in Additional file 1: Figs. S2 and S3, respectively. The correlations between environmental exposure variables are shown in Additional file 1: Table S3.

Table 2 Descriptive statistics of daily HAs for CD and environmental exposure data in Chengdu, 2015–2018

From 2015 to 2018, the total number of HAs for CD was 102,708, and the average number of daily HAs was 70 (std = 35). The daily mean ± std concentrations of PM2.5, PM10, PMC, O3, SO2, NO2, and CO were 57.9 ± 40.6, 95.7 ± 62.1, 37.8 ± 25.6, 96.6 ± 54.6, 12.7 ± 5.5, 53.9 ± 17.7, and 1030 ± 360 µg/m3, respectively. The value of AQI ranged from 16.7 to 404.6, with a mean of 85.2. The mean TEM and RH were 16.9℃ and 80.5%, respectively.

Model performance

Table 3 compares the performance of base learners, LSTM, and the proposed stacking model on CD dataset and stroke dataset.

Table 3 Performance comparison of different methods in predicting HAs for CD and stroke

On both datasets, ANN and LSTM surpass other individual models in terms of MAE, RMSE, and R2, but gain higher MAPE than tree-based models, and the stacking model is substantially superior to all individual models. After using LDS, the performance of the stacking model is further improved. On CD dataset, compared with the optimal results obtained by individual models, the MAE, RMSE, and MAPE of the stacking model with LDS remarkably reduced by 13.9%, 12.7%, and 5.8%, respectively, and the R2 increased by 6.8%. Additionally, the results of the t-test indicate that, when evaluated by most metrics, the performance gap between the stacking model with LDS and the best individual model is significant, and the difference between the R2 of them is visualized in Additional file 1: Fig. S4. Figure 3 shows a comparison between the observed HAs and the predictions of the stacking model with LDS on two datasets.

Fig. 3
figure 3

The comparison and residual between the observed HAs and the predictions of the stacking model with LDS on CD dataset and stroke dataset

Model explanation

Global explanation

Figure 4 shows the distribution of SHAP values of each feature in chronological order, and the features are ranked according to the average of their absolute SHAP values over all the training samples, which represents their global importance.

Fig. 4
figure 4

Heatmap plot of SHAP values of all features across all samples in the CD training set. The width of the black bar on the right-hand side shows the global importance of each feature. a Calendar features and HAs features b Environmental features

Most straightforwardly, a calendar feature nearly always had similar SHAP values when it remained at the same values, resulting in visually prominent color blocks with a periodic alternation. In contrast, the impact of environmental features varied over samples without explicit patterns. In light of additive feature attribution theory, the predicted HAs could be regarded as the additive combination of three parts: a baseline (generally the average of predictions), a regular difference attributed by calendar features and historical HAs, and a highly volatile difference attributed by environmental features. As shown in Table 4, the second part depicted the general trend of variations in the number of daily HAs, and the third part served to further enhance the model performance.

Table 4 The improvement of model performance attributed to environmental features estimated by SHAP

After further observing each feature shown in Fig. 4, we found that TS and historical HAs played a major part in profiling the growth trend in HAs, and DOW served to depict the periodic variation of HAs caused by the work-rest schedule in hospitals (see Additional file 1: Fig. S2). TEM in the fall and summer contributed to increasing the predicted HAs. Notably, the annual peak concentrations of several air pollutants, such as O3, CO, and PM2.5, and their SHAP values pushing up the predicted HAs occurred at similar times (see Additional file 1: Fig. S3).

Local explanation

As the contributions of calendar features and HAs features are relatively straightforward and regular, it makes more sense to concentrate on how the involvement of environmental features improves the model performance. Thus, the samples on every Wednesday in August 2018 were selected to fix the calendar features except TS, where August was selected to further explore the impact of high temperatures on the risk of CD, and Wednesday was selected to reduce the interference caused by weekend breaks and Chinese holidays. The sum of the first two parts of the predictions mentioned above was set as a new baseline. Figure 5 shows how the SHAP values of environmental features were accumulated from the new baseline to reach the final predictions.

Fig. 5
figure 5

Waterfall plot of SHAP values to four selected samples, i.e., samples on August 7, 14, 21 and 28, 2018. The new baselines and the final predictions are marked at the bottom and top of the image, respectively. The SHAP values of each feature are listed on the bar

By comparing Fig. 5 and Additional file 1: Fig. S5, it was observed that the TEM in August and peak concentrations of PM2.5 and O3 that appeared in the six days leading up to August 28 served to increase the predicted values of HAs, while gradually declining RH over the previous 6 days lowered the predicted values on August 21 by around 1 count.

Discussion

Model performance analysis

A distinct improvement of the stacking model compared to the base learners can be attributed to three aspects: 1) ANN and tree-based models performed best when evaluated by different metrics, which reflected the heterogeneity of their predictions and laid the foundation for meta learner to combine their strengths and obtain a better generalization [36]. 2) The key features served to help the meta learner understand how to choose and combine the predictions of base learners under various circumstances. 3) The re-weighting method based on LDS reduced the error caused by label imbalance, as shown in Fig. 6.

Fig. 6
figure 6

The left side shows empirical label distribution plots, and the right side shows comparison plots of error before and after using LDS on two testing datasets: a CD and b stroke

As shown in Additional file 1: Table S3, there are significant correlations among environmental exposure variables, however, the performance of the stacking model was not impacted because the base learners and the meta learner we selected can effectively handle multicollinearity. In Ridge and elastic net, the L2 regularization term added to their loss function can help stabilize the estimates and reduce overfitting in the presence of collinearity [41].

LSTM has been successfully applied in many fields, but in our task, its prediction performance is still inferior to the proposed stacking model. There are two potential reasons: 1) The size of our dataset is limited, with a total of 1461 samples, therefore, using lightweight models with simple architectures and fewer parameters can help avoid overfitting. 2) In our task, the time lag was set to 6 days to consider the short-term effect of environmental exposure, which constrained the advantage of LSTM in learning the long-time dependencies. However, the calendar features we extracted, such as YEAR, SEA, and MON, can assist the stacking model in capturing long-term trends.

Comparing SHAP explanation and conventional association analysis

In our study, the explanations obtained by SHAP displayed a strong agreement with the conclusion of association analysis using conventional analytical methods based on statistical models, such as GLMs and generalized additive models (GAMs). For example, the TEM in summer and the peak values of air pollutant concentrations always played a role in describing the increase in HAs. This is consistent with the previous studies, which found high levels of air pollution and high temperatures were associated with a high morbidity of stroke [4, 5, 42, 43], but we did not observe that cold weather served to drive up the predicted HAs, probably because the minimum TEM (-1.1℃) during our research period in Chengdu did not reach the extreme cold defined in related studies. This consistency indicates that our stacking model can accurately describe the relationship between daily HAs for CD and environmental factors, thus improving the model performance.

Moreover, if we associate SHAP values that push up the predicted HAs with an increased risk of CD, the model explanations can be regarded as a new perspective to explore the adverse health effects of air pollutants and extreme weather conditions. For example, the relationship between the risk of CD and the lagged TEM and RH can be depicted in Fig. 7. Furthermore, SHAP can comprehensively account for the combined impact of multiple environmental factors, whereas most traditional methods can only analyze the association between single environmental factors and HAs after controlling confounding effects among multiple covariates by smoothing functions.

Fig. 7
figure 7

SHAP dependence plots that show the effect of TEM lag5 and RH lag1 on the predictions of HAs

Limitations

Several limitations should be addressed in our study. First, limited by the available data sources, we only considered the impact of ambient air pollutants and meteorological conditions on daily HAs for CD, but some other environmental factors and individual health behaviors may also play important roles in the development and severity of CD. Second, our proposed model is only applicable to predict HAs for non-communicable diseases, such as CD, which are associated with environmental exposure, and the model might not be suitable for forecasting the daily number of hospitalizations for infectious diseases. Third, the peak values of HAs are not well predicted, which could lead to under-allocation of medical resources.

Conclusions

This study proposed a stacking ensemble model to predict the daily number of HAs for CD using the HAs data, air quality data, and meteorological data. The experimental results showed that our proposed model is superior to the base learners and LSTM on two datasets under four evaluation criteria. Moreover, the model explanation demonstrated that environmental factors played a role in further improving the model performance and identified that high TEM and high concentrations of gaseous air pollutant might strongly associate with an increased risk of CD. This study indicates that the proposed model considering environmental exposure factors is efficient in predicting daily HAs for CD and has practical value for hospital management teams in early warning and healthcare resource allocation.