In this section, we evaluate our methods on predicting the recovery of patients with traumatic brain injury. Details about the dataset have already been presented in Sect. 3.1. Here, we describe our evaluation framework.
Evaluation settings and framework
We have presented two methods, i.e. one for projecting the patients into future given his current state, EvolutionPred, and the other for predicting the recovery of the patients, EvoLabelPred, given his current state and current label, e.g. at \(t_{\rm pre}\).
Framework for EvolutionPred
To evaluate the performance of the projections from EvolutionPred, we are inspired by the mean absolute scaled error (MASE) [25], which was originally designed to alleviate the scaling effects of mean absolute error(MAE). To define our variation of MASE, we assume an arbitrary set of moments \({\mathcal {T}}=\{t_1,t_2,\ldots ,t_n\}\). For an individual \(x\), we define the MASE of the last instantiation \(x_n\) as
$$ MASE(x)={d(x_{proj },x_n)}/ {\frac{1}{n-1}\sum _{i=2}^{n-1}d(x_i,x_{i-1})},$$
where \(d()\) is the function computing the distance between two consecutive instantiations of the same individual x. This function normalizes the error of EvolutionPred at the last moment \(t_n\) (nominator) to the error of a naive method (denominator), which predicts that the next instantiation of x will be the same as the previous (truly observed) one. If the average distance between consecutive instantiations is smaller than the distance between the last instantiation and its projection, then MASE is larger than 1. Obviously, smaller values are better.
We further compute the \(Hits()\), i.e the number of times the correct cluster is predicted for a patient x. Assume that instantiation \(x_{\rm pre}\) belongs to cluster \(c_{\rm pre}\) and let \(c_{proj }\) denote the \({ first_{match} }(c_{\rm pre})\) (cf. Eq. 1) at the projection moment \(t_{proj }\). We set \(Hits(x)=1\), if \(c_{proj }\) is same as \(c_{\rm post}\) (i.e. cluster closest to \(x_{\rm post}\)), otherwise \(Hits(x)=0\). Higher values are better.
For model purity, we compute the entropy of a cluster c towards a set of classes \(\xi \), where the entropy is minimal if all members of c belong to the same class, and maximal if the members are equally distributed among the classes. We aggregate this to an entropy value for the whole set of clusters \(\zeta \), \(entropy(\zeta ,\xi )\).
In general, lower entropy values are better. However, the labels used by the EvolutionPred are Control and Patient: if a clustering cannot separate well between patient instantiations and controls, this means that the patient instantiations (which are the result of the projection done by EvolutionPred) have become very similar to the controls. Hence, high entropy values are better.
For learning evolutionary prediction model, we use a bootstrap sampling [26] with a sample size of 85 % and 10,000 replications. Model validation is done with the help of out-of-sample data. For clustering the union of projected instances and the controls, we use K-Means clustering. We use bootstrap sampling with a sample size of 75 % and 1,000 replications, and vary \(K=2, \ldots 8\).
Framework for EvoLabelPred
In order to evaluate EvoLabelPred, we use accuracy to assess the quality of computed labels towards the ground truth that we established in Sect. 3.2. Additionally, we will vary the parameter for the number of the subgroups, i.e. K = 3, 4.
To learn an evolutionary label prediction model, we use a bootstrap sampling [26] with a sample size of 85 % and 5,000 replications. Sampling is done without replacement, i.e. duplicates are not allowed. Model is validated on the objects that are outside of the sample.
Evaluation results
Evaluating evolutionary projection
Validation of the projection from
\(t_{\rm pre }\)
to
\(t_{\rm post }\): In the first experiment, we project the patient instantiations from \(t_{\rm pre }\) to \(t_{\rm post }\). Since the true instantiations at \(t_{\rm post }\) are known, we use these projections to validate EvolutionPred, whereupon evaluation is done with the MASE and Hits measures (cf. Sect. 4.1). Figure 3 depicts the hard and soft projections of the pre-treatment patient instantiations, while Table 3 depicts the MASE and Hits values for each patient separately. We perform 10,000 runs and average the values per run.
In Fig. 3, we can see that the hard projection (yellow) and soft projection (green) behave very similarly. Both predict the patient instantiations at \(t_{\rm post }\) very well: the mean values for the projected patient instantiations are almost identical to the true instantiations, and the shaded regions (capturing the variance around the mean) overlap with the variance of the true values almost completely.
Table 3 Hard and soft projection of patients from \(t_{\rm pre }\) towards \(t_{\rm post }\), with MASE and Hits per patient: low MASE is better, values larger than 1 are poor; high Hits are better, 1.0 is best; averages over all patients after excluding outlier patient #14
The first row of Table 3 enumerates the 15 patients in the TBI dataset, and the subsequent rows show the MASE values for the hard, respectively, the soft projection. The last row shows the Hits value per patient. The last column averages the MASE and Hits values over all but one patient: patient #14 is excluded from the computation, because prior inspection revealed that this patient is an outlier, for whom few assessments are available. All other patients exhibit low MASE values (lower is better), indicating that our projection mechanisms predict well the patient assessments at \(t_{\rm post }\).
Projection from
\(t_{\rm post }\)
to the future
\(t_{\rm proj }\): In the second experiment, our EvolutionPred projects the patients after treatment start towards a future moment \(t_{proj }\), which corresponds to an ideal final set of assessments that the patient might ultimately reach through continuation of the treatment. We do not have a ground truth to evaluate the quality of our projections. Rather, we use a juxtaposition of patients and controls, as depicted in Fig. 4. We show the averages of values per population through a solid line, around which we expand to the variance of values for each variable. The cyan line and surrounding cyan-shaded region stand for the moment
\(t_{\rm pre }\), denoted as “Pre” in the legend; the blue line and region stand for the moment \(t_{\rm post }\) (“Post”), while the “Controls” are marked by the red line and red-shaded region. Except for Gender and Age, for which controls have been intentionally chosen to be similar to the patients, patients differ from controls. Even where we see overlap between the red area and the cyan (Pre) or the blue (Post) area of the patients, as for assessments CIM and CV, we also see that the average values are different.
Figure 5 shows the same lines and areas for assessments before and after treatment start (Pre:cyan, Post:blue) as shown in reference Fig. 4, but also the projected assessment values (Proj: green/yellow). These projected assessments are closer to the controls, indicating that at least for some of the assessments (FAS1, ICP, CIM, CV, MT, VP), treatment continuation may lead asymptotically to similar values as for the controls.
Clustering patients with controls: We investigate whether the patients can be separated from the control population through clustering. We skip the assessments TMT-B, BTA, WCST-NC and WCST-RP, which have been recorded only for some patients. We cluster the controls with the patient instantiations before treatment (Pre: red line), after treatment start (Post: yellow line), with the Hard projected instantiations (green line) and with the Soft projection (blue dashed line). We use bootstrapping with a sample size of 75 % with 1,000 replications. In Fig. 6, we show the entropy while we vary the number of clusters K. Higher values are better, because they mean that the clustering cannot separate controls from patients. High values are achieved only for the projected instantiations.
In Fig. 6, the entropy values are very high for the clusters containing controls together with projected patients, whereby soft projection and hard projection behave identically. The high values mean that the clustering algorithm cannot separate between projected patients and controls on similarity; the instances are too similar. This should be contrasted with the clusters containing controls and patients before treatment (red line): entropy is low and drops fast as the number of clusters increases, indicating that patients before treatment are similar to each other and dissimilar to controls. After the treatment starts, the separation between patients and controls on similarity (yellow line) is less easy, but an increase in the number of clusters leads to fair separation. In contrast, projected patients are similar to controls, even when the number of clusters increases: the small clusters contains still both controls and patients.
Evaluating evolutionary label prediction
Table 4 Label prediction accuracies of each patient for EvoLabelPred with GroundTruth based on ICP attribute
We present the results from the label prediction experiments on TBI dataset in Table 4. In the experiment, we first learned the evolutionary model using EvoLabelPred with \(K=3, 4\) and then utilized the conditional probabilities-based label prediction (cf. Sect. 3.3.4) within each individual cluster to predict the labels for the out-of-sample patients. The accuracy of label prediction for the label learned from ICP variable is very low: the method is able to achieve a very high accuracy for some of the patients, but if fails completely for other patients.
Table 5 Meta Information on the clustering model from Fig. 7
To reflect on the low accuracies of the label prediction, we show the clusters from \(t_{\rm pre }\) and \(t_{\rm post }\) in Fig. 7, after removing the outliers. The membership information is given in Table 5. We can observe how patients move closer to the controls (depicted as a dashed blue line) from \(t_{\rm pre }\) to \(t_{\rm post }\). The clusters take into account the changes in the similarity among patients, but this does not lead to meaningful predictions. Upon inspecting the dataset, we discovered that the ICP variable is not correlated with other attributes in the TBI dataset. One would expect this to be true, because the selected cognitive tests that are not correlated to each other. We can clearly see from the above experiments that it is not really possible to predict the ICP values from the values of other cognitive tests.
Table 6 Label prediction accuracies of each patient for EvoLabelPred with GroundTruth based on ICP ICP variable; PCA was applied over the TBI dataset prior to the learning of the evolutionary model
We conducted further experiments to test this non-correlation among the variables. We applied PCA on TBI dataset prior to model learning. We present the results in Table 6 with EvoLabelPred model based on K = 3 clusters and conditional probabilities-based label prediction. Although we see slight improvement compared to our results without PCA (cf. Table 4), the overall performance is low. After removing the outliers from the label prediction model, the performance of our label prediction even dropped considerably. This means that the ICP variable does not predict well whether the patient has recovered or not (contrary to the expectations).