Algorithm Feature Set
A large sample of potential features for exacerbation identification were used in this study. So as to prevent model overfit, the most relevant features were selected by applying a selection algorithm. The choice of algorithm was left as a hyperparameter to be optimized in model training. As an example, we present the reduced feature set chosen by the logistic feature selection method (implemented by ) for the triage prediction task in Table 1.
Algorithm performance was measured against a 100-case validation set, with ground truth being taken as the majority opinion of a panel of 9 specialists (experience and type of specialist are indicated in Table 4). We found that for each of triage, exacerbation, and treatment identification, the top-performing algorithm consisted of a combination of linear discriminant analysis and naive Bayes classifiers combined through a soft voting strategy. Major performance metrics are given in Eqs. (1)–(7). The top row in Figure 3 displays the accuracy of the algorithm against the accuracy of the individual doctors alongside the performance of the algorithm compared with the average doctor on each of the performance metrics.
For the triage class, the algorithm agreed with the consensus opinion in 89% of cases whereas the top doctor achieved an 83% accuracy. In exacerbation prediction, the algorithm again outperformed the leading physician with 94% accuracy compared with 91%. When recommending treatment, the algorithm reached a 68% agreement with the majority compared with 64% from the top physician.
The bottom row in Figure 3 shows an analogous plot to the row above but includes performance results in which the algorithm did not vote towards the consensus decision. The ground truth was thus taken to be the majority of just the physician decisions (a comparison that inherently offers a disadvantage to the algorithm). This test still found strong performance of the algorithm with it scoring second-placed in accuracy and retaining higher scores than the average physician in each of the performance metrics.
Confusion Matrix Analysis
The confusion matrices for the algorithm against the top-performing doctor are displayed in Figure 4. These along with an evaluation of the metrics listed in Table 2 provide a comprehensive summary of algorithm performance compared with that of the physicians. When assigning triage categories, the algorithm achieved an 89% sensitivity in triaging to the ER and a 99% specificity in assigning a medical attention category. This is significantly better than the average doctor and better or equal to the top doctor in each category. The PPV achieved was 89% which is significantly better than the top doctor who had a 50% PPV due to a large over-prediction of ER cases.
It is worth noting that if we remove the algorithm’s “opinion” from the validation set consensus, the algorithm still achieves the top performance in triage identification with 83% prediction accuracy compared with 77% from the top-performing physician. This finding indicates that the algorithm maintains top performance even when subjected to a test that unfairly advantages the medical specialists.
From the confusion matrix of treatment prediction, it is important to remember that class ‘5’ corresponds to a finding that the specialist was not comfortable making a treatment recommendation”.
In Table 3, we indicate the 10 most relevant features when predicting the triage and exacerbation categories. In both cases, the features selected by the algorithms are assumed to be the most discriminating when diagnosing HF exacerbation events.
Physician Decision-Making Trends
In Figure 5, we plot the distribution of decisions made by each physician (left charts) alongside the averaged physician distributions with error bars that denote 1 standard deviation from the mean, for each of the target variables. We see significant variation in opinion between physicians; for example, doctor 4 believes 87.5% of cases warrant medical attention but doctor 8 only 34.0%. We note that only 4.0% of total triage assignments were ever more than one triage category away from the consensus decision, so decisions are very rarely made far from the average. We also observed wide variation in apparent definitions of exacerbation; for instance, in only 45% of the cases that doctor 2 determines as experiencing an exacerbation do they also triage a medical attention category. Conversely, 100% of the cases doctor 4 determines as exacerbating are also given a medical attention triage label. We finally note that 8.5% of cases triaged to a medical attention category were not predicted to be experiencing an exacerbation, suggesting that physicians may have thought an alternate diagnosis was driving symptoms.
Robustness of Validation Set Consensus
In Figure 6, we plot information regarding the number of cases that change when an additional doctor is added to physician panels of differing sizes. For example, the data at the point “4 to 5” refers to the number of validation case consensus opinions that change when the panel size grows from 4 to 5 physicians. We compute the % change for all possible combinations in the data set and plot the max change, min change, and mean change with associated standard deviation. A convergence of consensus opinion is observed as opinion as more doctors are added with the change being just 5.2% when transitioning from 8 to 9 doctors, 9 being the total number of opinions in the final validation set.