Introduction

Rising health care costs are a major economic and public health issue worldwide [1, 2]: According to the World Health Organization, health care accounted for 7.9% of Europe’s gross domestic product (GDP) in 2015 [3]. In Switzerland, the health care sector contributes substantially to the national GDP, and has increased from 10.7 to 12.1% between 2010 and 2015 [3]. Moreover, because health care utilisation costs may serve as a surrogate for an individual’s health status [4], understanding which factors contribute to increases in health expenditures may provide insight into risk factors and potential starting points for preventive measures.

Several studies [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21] have addressed the prediction of health care costs, approaching the issue as either a regression problem or a classification problem (classifying costs into predefined “buckets”). Morid et al. [22] conducted a literature review summarising and comparing the existing models. As far as the annual difference in costs is concerned, we are aware of only 1 study [23], which classified healthcare costs development into only two classes (binary classification). Previous studies also examined a broad variety of features. The most commonly used features include different sets of demographic features, health care utilisation parameters (e.g. hospitalisation or outpatient visits), drug codes, diagnosis codes, procedure codes, various chronic disease scores and cost features.

In this study, we aimed to predict changes in patients’ health care costs in the subsequent year and to identify factors contributing substantially to this prediction. In particular, we focused on the role of pharmacotherapy and other medical features such as hospitalisations and outpatient physician visits. We approached the problem as a binary classification task, predicting whether patient’s total costs would increase or decrease in 2015, based on their characteristics in 2014. We compared the performance of 3 different models: feedforward neural networks (FNN), boosted decision trees (BDT) and logistic regression (LR). To capture different patterns in the data, we performed extensive feature engineering and introduced new domain-specific features, such as the drug administration mode. Finally, we performed a detailed feature importance analysis and subgroup analysis, based on the decision tree model.

Methods

Study data

We used anonymised claims data provided by the Helsana Group, one of the largest health insurance companies in Switzerland, which covers about 15% of the population across all regions of the country [24]. Basic health insurance coverage is mandatory in Switzerland. All residents are free to choose their preferred insurance providers, which are privately owned. Insurance coverage is financed by a premium and includes co-payments and deductibles [25]. The amount of the deductible can be chosen by the patient and changed every year. All health care invoices submitted for reimbursement are recorded in Helsana’s claims database [24]. The full dataset comprised information on adults (aged ≥18 years) without additional private insurance. All patients were insured by Helsana throughout the study period (2014–2015), allowing for complete records for both years. Furthermore, we required that all patients had at least 5 drug prescriptions in both calendar years and complete records on all demographic variables. In total, 373′264 patients met these requirements. Our dataset comprised demographic parameters, information on health insurance status, prescribed drugs, claimed health care utilisation, and total costs for each patient. Total costs were defined as gross costs for all invoices submitted for reimbursement, thus not taking co-payments and deductibles into account. Prescribed drugs are displayed using the Global Trade Item Number (GTIN). Additionally, the active component (5th-level Anatomical Therapeutic Chemical (ATC) code [26]) is available for every drug. Diagnoses are not available in our dataset because of legal regulations in Switzerland.

Introduction of features

Feature engineering plays an important role in most of the machine learning models and can greatly improve prediction accuracy for any task.

Our exploratory linear regression analysis revealed that, compared with the prediction of total costs, the variance of the difference in costs is harder to explain using basic features such as demographics [5, 6, 13,14,15,16, 19] or simple count measures [17] described in the literature (Additional file 1: Table S1). Therefore, we performed extensive feature generation to include additional predictors in our models. We assigned names to the feature sets, which we later use to discuss their relative importance for the overall accuracy.

Basic features

The included demographic features were age, gender, deductible amount, insurance model and area of residence. We also included the simple count measures of numbers of hospitalisations, outpatient physician office visits, different drugs, and the number of individual prescriptions (GTINs). Because our dataset lacks diagnosis codes, we approximated chronic conditions following the ATC classification proposed by Huber et al. [25] and computed the number of prescribed ATC codes corresponding to each group.

Features representing pharmacotherapy

In addition to the derived chronic conditions, we included explicit drug information. To reduce sparsity, we chose 4th-level ATC [26] codes (eg. C01AA, statins) over the 8′705 unique GTINs or the 1′027 5th-level ATC codes. For each of the resulting 449 categories, we computed the number of corresponding prescriptions.

Additional features

We included the following additional features: Hospitalisation was identified using Swiss diagnosis-related group (DRG) codes [27]. We generated features displaying the major diagnostic categories derived from DRG codes (e.g., hospitalisation for diseases of the respiratory system), the type of hospital, and the type of harm (e.g., accident, disease), as well as the overall length of hospital stay. To capture temporal patterns [28], we computed the frequencies of outpatient office and bedside visits per month and per quarter of the year. We also included physician’s specialisation, the institution dispensing the drug and the number of visits on weekends (which might indicate acuteness) as features. Additionally, we computed the frequencies of prescriptions for different fine-grained periods of time and the number of prescribed products with certain modes of administration (e.g., intravenous) for each patient. The number of different drug classes and prescriptions (defined as different purchase dates), as well as features representing psychiatric treatment, rehabilitation, nursing home stays, and home care were also included. Finally, we generated a number of descriptive statistics (median, mean, standard deviation, minimum, and maximum) for intervals between, for example, prescriptions, visits and home care to capture a regularity pattern. Our expectation was that, the more regular these events are, the more continuous is the treatment, and that irregularity might point to a more acute condition.

Costs feature

Total healthcare costs in 2014 was included only to assess the overall accuracy and to determine whether the medical features provided complementary information.

Data split

Using random assignment, we divided the dataset into 3 parts: training set (80%), validation set (10%), and test set (10%). The training set was used to develop the prediction models, and the validation set was used for assessing the performance of various methods and for subsequent tuning of the hyperparameters. The test set was reserved for reporting the performance of the final models. We report the basic descriptive statistics in Table 1.

Table 1 Study population characteristics (2014)

Models

We used 3 different methods to develop models for our analysis. As a reference model, we used LR and contrasted its performance to FNN and BDT. All models were developed starting with a set of demographic features. Additional feature sets were added in a stepwise manner, resulting in a total of 747 different features in the complete model (Table 2).

Table 2 Comparison of prediction performance of logistic regression (LR), boosted decision tree (BDT) and feedforward neural network (FNN) using different sets of features

Because we use BDT (in particular the XgBoost [29] library) extensively for the subsequent analyses, a short overview is in order: BDT is a variant of decision tree methods with a gradient boosting algorithm governing the learning process. In decision trees, the input is mapped to a target label by a recursive creation of decision rules [30], which can be represented as nodes in a graphical tree model. The gradient boosting method produces a prediction model in the form of a weighted average of several weak predictors (decision trees).

Feature importance analysis using BDT

We used BDT to conduct detailed feature and drug-importance analyses. Using BDT, decision rules can be mapped into respective cuts in our feature space, generating subgroups of patients with a high probability of an increase in costs. In particular, we were interested in medically relevant subgroups, with a particular emphasis on pharmacotherapy.

General feature importance

We used backward deletion to assess the general feature importance. Backward deletion begins with all candidate features (here, the complete model), and the deletion of each feature is tested using a chosen model fit criterion. The feature that makes the most statistically insignificant contribution to the model fit quality is deleted. The process is repeated until no further variables can be deleted without a large loss in accuracy. This process is displayed in Additional file 1: Figure S1 in the supplement.

Drug importance analysis

Conditional drug probabilities

The feature importance analysis based on backward deletion selects features according to their overall contribution to the total accuracy. As the latter depends on the feature’s frequency in the dataset and its relative discriminative contribution, more frequently prescribed drugs have an advantage over those that are prescribed less frequently, even if discriminating less efficiently. In order to get additional insight into the drug-importance, we computed the probability of increase, conditioned on the drug classes and stratified by hospitalisation.

Weight analysis

Although conditional drug probabilities provide an important overview, interactions of the drug classes with other features (except for hospitalisation) could not be assessed. Therefore, we performed a weight analysis to investigate the decision tree model predictions using the test set. To understand the concept of weight analysis, it is important to clarify how the BDT prediction is generated during the inference stage. For a given input sample, the BDT maps every feature in the sample to learned weights or scores. The individual score can be either positive or negative, depending on whether the feature contributes to increase or decrease prediction, respectively. The final prediction is an increase, if the sum of all scores is positive; otherwise it is a decrease. Thus, by analysing the weights of particular features using a sample of inputs, one can understand how often and how strongly these features contribute [31]. Using this intuition, we filtered out the drug classes that contributed to increases or decreases with a high proportion (at least 5% of the overall positive or negative score).

Subgroup analysis

BDT produces a prediction model in the form of a weighted average of several weak predictors. To find examples of highly predictive subgroups involving drug classes, we employed the following strategy: First, we filtered out all decision paths in all trees where a particular drug class was used. More precisely, we considered only the paths where the prescription of the drug contributed. Next, we measured the conditional probability of increase for the cuts given by the filtered paths. We denote this probability by P(increase | cut). For every such a cut, we computed the conditional probability without the drug class cut, P(increase | cut without drug class). We defined a gain to be the difference |P(increase | cut) - P(increase | cut without drug class)|. Lastly, we chose the subgroups with high values of gain.

Results

In Table 1 we show the basic descriptive statistics for the total dataset, as well as for the three subsets. As one can see from the table, the training, validation and test datasets follow the same distribution over all parameters. In particular, it is important that the variation of the annual cost difference and the proportion of cost increase/decrease is small (within ±0.6% for the cost increase).

Performance of models

The BDT model performed the best, leading to 67.6% accuracy and an area under the curve (AUC) score of 0.74, indicating good discrimination between the classes. The receiver operating characteristic curves of all 3 models are presented in Fig. 1. Table 2 indicates the performance of the models on different sets of features. Whereas demographic features alone were not predictive at all, adding simple count measures — especially the number of outpatient office visits and the number of hospitalisations — substantially improved prediction accuracy. The effects of additional features (n = 264), total costs, and pharmacotherapy (n = 449) were about the same (2–3%), depending on the chosen model. Once combined, the overall accuracy further improved by more than 1%, indicating that these features contain complementary information. As for the model comparison, FNN and BDT consistently outperformed LR by about 2%. Moreover, BDT generalised better on the unseen samples, outperforming the FNN in accuracy by about 0.4%.

Fig. 1
figure 1

Area under the receiver operating characteristic curve (AUC): Comparison of prediction performance. LR = logistic regression, BDT = boosted decision tree. FNN = feedforward neural network

General feature importance

Gradually adding feature sets already provides some intuition about their relative importance, but decision tree models can be further utilised for the systematic analysis of feature importance. Using backward deletion, we found that the number of features could be reduced up to 36, with only a 0.5% loss in the accuracy (Table 2, Additional file 1: Figure S1). We identified the length of hospital stay, total costs, and intravenous mode of drug administration as the most important features. The full list of 36 features is presented in Additional file 1: Table S2. The list comprises both demographic and various medical features such as the number of individual prescriptions, the temporal pattern of outpatient visits, and diabetes as a chronic condition. Interestingly, the following 6 drug classes remained in the model: A03BA (belladonna alkaloids), B03BB (folic acid), N01AH (opioid anaesthetics), N01AX (other general anaesthetics), S01BC (ophthalmologic non-steroidal anti-inflammatory agents) and S01CA (ophthalmologic corticosteroids and anti-infectives in combination).

Drug importance analysis

Conditional drug probabilities

For the total study population, irrespective of prescribed drugs, the probability of cost increase was 51.9%. Conditioned on hospitalisation, the probabilities for increase were 23.1 and 58.1% with and without hospitalisation in 2014, respectively. We subsequently computed the probabilities of increase or decrease in costs conditioned on the 449 drug classes and on hospitalisation. The results are presented in Table 3. In particular, we present the drug classes with the highest probabilities for cost increase or decrease and with frequent prescriptions. All 6 drug classes identified in the previous section are included in this table, with only folic acid (B03BB) being an indicator for an increase in costs.

Table 3 Probabilities of cost increase and decrease for patient groups, conditioned on drug groups and hospitalisation

Weight analysis

Through the weight analysis, we identified additional drug classes that contributed to the accuracy of prediction. Many of them were found to contribute to predictions of both increases and decreases (Table 4). For instance, magnesium is among the drug groups with a high accuracy for increase (71.4%), but also an important feature for decreases among the patients without hospitalisation (78.6%).

Table 4 Weight analysis: Contribution of drug classes to the prediction

Subgroup analysis

We present examples of the subgroup analysis in Table 5. We found small (100–600 people) but highly predictive subgroups for costs increases (as high as 88%). Moreover, the gain because of the drug class was high, reaching up to 23% for folic acid (Example #1) and 21% for oral iron supplements (Example #3). In addition to drug classes, subgroups were further characterised by a variety of features, including outpatient visits, drug prescription information (both counts and temporal information), information on the deductible, home care, and hospitalisation. Example #7 represents a rather large subgroup of patients without hospitalisation that have a high fraction of decrease (fraction of decrease 0.74, gain 18%).

Table 5 Examples of subgroups derived from the decision tree

Discussion

Our models classify patients according to their probability of an increase in costs, with especially a few features contributing substantially to the prediction. Pharmacotherapy provides important information on the cost increase prediction, and its relative importance increases in interaction with other features including health care utilisation. We identified patient subgroups with very high probabilities of increase and decrease.

Performance of models

Our models predict whether patients’ total health care costs will increase in the subsequent year, with an accuracy of up to 67.6% (AUC 0.74). Lahiri et al. [23] reported a higher accuracy (77.6%) when investigating increases in inpatient claims costs using Medicare data. Although this study is the closest in terms of setting to our study, some major differences should be emphasised: First, Lahiri et al. predicted inpatient expenditures using both inpatient and outpatient information, whereas we consider the change in total health care costs using only outpatient claims and whether or not a patient was hospitalised. Moreover, they found diagnoses and features indicating the development of a new chronic condition the most important features. Diagnoses are not available in our dataset because of legal regulations in Switzerland, and the derivation of features indicating the development of a new chronic condition requires information from the year for which predictions are made. Because these data are typically not available in a prospective scenario, our study was designed so that all the features could be generated without any information from the subsequent year. We found that, for the prediction of a costs increase, medical and costs features contained complementary information. Additionally, the inclusion of medical features facilitates the identification of potential targets for preventive measures [32, 33].

General feature importance

In general, we found that high healthcare utilisation in the first year was an indicator for a decrease in the following year. Using backward deletion, we identified the 36 most important features, including, for example, length of hospital stay, home care, and count measures for outpatient visits and drug prescriptions. Simple count measures accurately capture the intensity of health care utilisation and therefore may reflect the severity of the disease state [17]. Additionally, when they are generated for multiple timeframes, these measures can be used to introduce valuable temporal information, which was highlighted in a recent study by Morid et al. [28] Interestingly, the counts of drug prescriptions and outpatient visits in the last quarter and the last month of the year are among the most important features, which indicates that the model assigns a risk of therapy continuation in the next year. Intravenously administered drugs are typically associated with some severe conditions, explaining why the intravenous mode of administration was an important feature in our study. Likewise, Pritchard et al. [1] reported that physician-administered injectable or infusible treatments account for a comparably higher fraction of expenditures in high-resource patients. We identified diabetes as an important chronic condition for the prediction of a cost increase, which is consistent with diagnoses identified as important in other studies [23]. In general, chronic conditions [2, 34] and multimorbidity [35] are well-described risk factors for high health care utilisation.

Drug importance analysis

We found that high probabilities of increase are mainly associated with drug groups used to treat chronic conditions that have a higher likelihood of worsening over time (e.g., anticholinesterases and dopa derivatives for treating dementia or parkinson). In contrast, drug groups associated with higher probability of decrease are predominantly used for severe acute conditions requiring extensive treatment (e.g., adrenergic and dopaminergic agents) or are proxies for expensive procedures, such as (local) anaesthetics used in day surgery. Evaluating the contribution of drug classes to the prediction using a weight analysis, we found that many drug groups contribute to the prediction of both increases and decreases. This finding indicates that the contribution of pharmacotherapy depends on other features and can vary greatly across subgroups.

Subgroup analysis

When evaluating several example drug groups in more detail, their contribution becomes even clearer. We identified subgroups with a high probability of increase (up to 88%). Although there may be even more, we can derive at least 3 higher-level groups from our examples: 1.) potentially pregnant patients who have not yet delivered; 2.) healthy patients; and 3.) patients suffering from chronic conditions with low use of health care resources. Pregnancy without delivery is considered an important condition for predicting future resource use [36] and is therefore included as a feature in some diagnosis-based comorbidity scores. Lacking diagnosis codes, our model identifies combinations of ATC codes (e.g., folic acid, magnesium), outpatient specialist visits for gynaecology, and few outpatient visits at the beginning of the year as patterns indicating potential pregnancy. For a subgroup of patients hospitalised for delivery, the model predicted a decrease in costs, with as much as 92% accuracy. The “healthy patients” group was characterised by few prescriptions (including at least 1 prescription for oral iron supplements) and a high deductible that did not change in the next year, indicating a self-assessment of very good health status. Self-reported general health has been found to be an important indicator of future health care utilisation in previous studies [18, 37]. Claims data do not include information on self-reported health, so changes in the deductible may serve as an indicator of patients’ individual expectations regarding upcoming health expenditures. Tamang et al. [21] found that patients with a large increase in costs were younger and less likely to have hospitalisation costs and chronic conditions, compared with persistent high-costs patients, which is consistent with our subgroup findings. The final group represents elderly patients suffering from a chronic or worsening conditions, with low use of health care resources, yet having a higher likelihood for an increase in the latter for the following year. Subgroups of patients with a high probability of a cost decrease were characterised by chronic conditions, with intensive health-related claims (hospitalisation, home care), or expensive diagnostic procedures or day surgery.

Limitations

Change in health care costs is a very broad outcome, and our data represents a whole population, without restrictions on underlying diseases or demographic groups. We therefore found multiple reasons for the increase and decrease in costs, many of which are not predictable or preventable (e.g., accidents). Diagnoses might have provided additional patient information, but they were not available in Swiss claims data. Expensive claims such as hospitalisation in the first year may mask less expensive changes such as new drug prescriptions or additional physician visits in the following year, making the development of costs unsuitable for the evaluation of causal drug-related risk-factors. Model-wise, the main limitation was associated with the sparsity in representing the prescriptions. We think that learning distributed embeddings via techniques similar to skip-gram [38] might mediate this problem. Moreover, it is an active research area to apply recurrent neural networks for learning representations of medical codes and patients [39,40,41,42,43]. In this context, the findings of our study can provide a good starting point for interpreting the results of such advanced models.

Outlook

This research focused on cost increase on the population level covering two subsequent years. Future research should cover multiple subsequent years. In a recent Danish study, Tamang et al. [21] reported that over the course of eight years, the majority of high-cost patients showed only one high-cost year. Among those with multiple high-cost years, many did not experience them consecutively. In the light of high fluctuation of individual annual costs, the evaluation of an increase in costs using a longer study period may provide insight into long-term effects.

Our project was designed to evaluate the risk factors for cost increase for the total population. While this approach allows for a broad investigation, it naturally reduces the impact of rare drug classes on the overall accuracy. However, such drug classes including chemotherapeutics or biologicals would be of special interest due to their contribution to the overall cost increase in healthcare. To evaluate the impact of rare but high-cost treatments in more detail, future studies have to focus on specific subgroups. This approach would reduce sparsity in the data and would allow to use substances instead of drug classes. Additionally, temporal information on treatment induction, duration and intensity should be included in future analyses.

Our results provide subgroups with high probability of cost increase. This information can help decision makers to optimise the healthcare services for these subgroups through an improved resource allocation planning. For instance, we identified a subgroup of healthy patients which are likely to develop a cost increase. This group may be further investigated with respect to causes, amount and preventability of cost increase. For patients suffering from chronic conditions with low use of health care resources, preventive measures such as disease management programs could be established. Additionally, patients may better choose their deductibles for the next year based on the prediction of the future cost development.

Conclusion

The development of costs can be predicted using a binary classification. Our results indicate that the contribution of pharmacotherapy depends strongly on other features and can vary across subgroups. Therefore, further studies may focus on the development of models for predefined and therefore less heterogeneous subgroups. The detailed understanding of such subgroups may help to identify potential starting points for improving patient management.