Background

Public and private regulators use risk adjustment models to prevent adverse selection, anticipate budgetary reserve needs, and offer care management services to high-risk individuals [1]. Preventing risk selection by insurers is a critical ethical, legal, and societal goal that risk adjustment models can address. Risk adjustment models attempt to capture the relationship between demographic and clinical variables (risk adjusters) and subsequent healthcare utilization or spending. The models are commonly derived through standard linear regression methods or their extensions, and rely on individual-level data commonly captured in administrative claims datasets [2]. All of the available models on the current commercial market are linear or log-linear regression models that leverage the same basic elements such as age, sex, diagnostic and procedure codes [3].

Risk adjustment modeling may be improved by both methodological and conceptual advances in the risk modeling and healthcare services literature. From a methodological standpoint, newer machine learning methods have recently emerged as alternatives or complements to linear regression for predicting highly variable health outcomes using large sparse datasets, including estimating healthcare costs using claims data [4, 5]. While traditional risk adjustment models are limited in modeling complexity and tend to underpredict expenditures of populations with very high expenditures [6, 7], machine learning methods may help to capture complex non-linear relationships and interaction terms among variables, which could explain why some individuals with complex constellations of risk factors and diagnoses experience substantially higher cost than predicted. For example, among people with low income and diabetes receiving insulin, food insecurity is associated with hypoglycemia and emergency room visits during the last week of each month (after income from a first-of-the-month paycheck is deprived) and hypoglycemic medications are still being taken [8]. These complex relationships are hard to model in standard risk equations, but can be potentially better captured by interactions-focused, nonlinear machine learning algorithms. Despite the promise of machine learning for risk adjustment, machine learning techniques have not yet been widely adopted for risk adjustment. This is partially because the machine learning models developed to date have not yet demonstrated superior predictive performance over traditional linear models on large datasets with more than a million enrollees [2].

From a conceptual standpoint, risk adjustment may also be improved by including additional area-level indicators of social determinants of health (SDH), such as poverty, unemployment, and education, which contribute to risk, utilization and cost [9,10,11]. Since before the UK Black Report and the Health Divide, epidemiologists have shown that while cultural and individual behavioral choices influence health, living conditions including the availability of resources (e.g., clean air and water), working conditions, and quality of food and housing have a particularly profound association with health outcomes [12]. More recent initiatives to directly address these ‘social determinants’ of health include strategies to refer patients with food insecurity to food pantries, those that are homeless to direct housing resources, and those with challenges with transportation to assisted transport services, as a means to improve clinical outcomes such as nutrition-related chronic disease metrics (e.g., nutrition affecting blood pressure and diabetes glycemic control) and to improve the ability to access healthcare visits and reduce stress-related adverse health outcomes [13].

The inclusion of SDH indicators into risk adjustment may particularly help plan payment estimation. SDH indicators may help capture previously unmeasured factors that could influence the course of disease, such as how poverty may affect chronic disease outcomes by affecting the ability to pay for medications or more nutritious foods, or how unemployment relates to mental health and associated course of disease related to depression and lower adherence [14, 15]. Individual-level SDH factors are rarely assessed or included in commonly-available data, but area-level SDH indicators are readily assessed by national data sources [16], and may be linked to the 5-digit ZIP code often available in claims data. Area-level SDH indicators were recently incorporated into risk adjustment models for the Massachusetts Medicaid program; their inclusion improved concurrent annual healthcare spending predictions for low-income adults [17]. It remains unclear, however, to what extent incorporating area-level SDH indicators could improve prospective annual healthcare spending predictions, particularly for the privately-insured population who constitute the largest share of insured people in the US, but for whom SDH factors may be less visible or influential than for the Medicaid population.

The objective of this study was to assess whether prospective risk adjustment models may be improved by machine learning methods and by the incorporation of area-level SDH indicators in a national privately-insured adult population.

Method

Data

Our primary data were healthcare claims from a single large national commercial insurer operating in all 50 US states, Washington D.C., and Puerto Rico (Fig. 1). From the claims data, we included privately-insured individuals 18 through 64 years old who had at least 24 months of continuous enrollment. Individuals who switched plans were implicitly excluded due to the continuous enrollment criteria, but individuals who moved were included. We used demographics (age, sex) and diagnostic codes (Clinical Classification Software categories [18]) as candidate risk adjustment variables at the individual level, and SDH indicators at the 5-digit ZIP code-level from the American Community Survey (ACS) by the U.S. Census Bureau (Table 1) [16]. SDH indicators were selected to reflect current conceptual theories concerning a broad range of social, economic, and health system factors that may influence health risk, utilization, or cost (Supplementary Information Table S1) [19, 20]. The resulting dataset contained claims data from 1.18 million unique members, which we randomly partitioned into training data (1.06 million members) for model derivation and test data (0.12 million members) for model performance assessment [21]. There was no overlap of individuals among the two data subsets. Demographic statistics of the data subsets by geographic location are reported in Supplementary Information Table S2.

Fig. 1
figure 1

Data Set Selection Flow Diagram. Administrative claims data was obtained from a single national private insurer. The dotted arrow means predictors were optionally incorporated but no members were added or excluded

Table 1 Statistics of ZIP Code-Level Social Determinants of Health

Outcome

We sought to prospectively predict 2017 individual-level total annual healthcare spending from 2016 data. As a secondary objective, we also considered concurrent risk adjustment, predicting 2016 member-level annual spending from 2016 data, as is done, for instance, in the Affordable Care Act health insurance exchanges (see Supplementary Information). We estimated total annual spending by summing standardized costs in U.S. Dollars over 12 months, including post-year claims corrections, and including zero spending among enrolled individuals without medical claims. To inhibit outliers from affecting model fit, costs in the training set were top-coded at $400,000 (cost larger than $400,000 was replaced with $400,000), which corresponded to the top 0.1% cost of members in the training set. Top-coding is performed to reduce model sensitivity to skewness and kurtosis, and has been preferred over dropping members with high cost since these cases can be indicative of specific conditions which are associated with high cost [2, 22, 23].

Model development

A 2-by-2 factorial design was employed to compare modeling approaches (linear regression versus the machine learning approach of gradient boosted decision trees), and variable choice (demographics and diagnostic codes alone versus additional area-level SDH indicators). In each of the methods, individual-level predictors with their associated area-level predictors are input to the model together as if they were all individual-level properties.

Linear regression approach

A linear model derived through ordinary least squares regression was trained to predict 2017 spending based on 2016 member characteristics. We additionally developed penalized linear regression models using methods that may better address collinearity (Least Absolute Shrinkage and Selection Operator [LASSO]), as detailed in the Supplementary Information. LASSO regression tends to sparsely select among collinear variables by forcing coefficients to zero for all but one of the collinear variables [24, 25].

Machine learning approach

The machine learning approach investigated in this study was gradient boosted decision trees [26]. This approach involves the construction of an ensemble of decision trees, where each tree learns from the errors of the prior tree (a “boosting” approach) to iteratively improve predictions [27]. With each iteration, a new tree is constructed by sampling from the data and identifying which variable most effectively divides the members into groups with low within-group variation in cost and high between-group variation in cost. This variable selection process is repeated to further divide each resulting subset of the data, producing a series of branches in the decision tree. The tree is added to the current ensemble, and then the next tree is fit using the same process on the residuals of the ensemble.

We chose gradient boosted decision trees over alternative machine learning methods because the approach has been shown to handle mixes of categorical and continuous covariates, capture nonlinear relationships, and scale well to large amounts of data [28]. Moreover, it is straightforward to obtain variable importance rankings from the model, which may permit the approach to be more interpretable than many other machine learning methods, for which acceptability in a healthcare services context may critically depend on visualizing “black box” predictions [29]. We used the LightGBM framework to develop the models, which implements several algorithmic optimizations on standard gradient boosting to allow for additional training efficiency [30]. A detailed treatment of gradient boosted decision trees and LightGBM is provided in the Supplementary Information. We used 3-fold cross validation on the training data subset to select the parameters for the model, including the number of trees, the maximum depth of each tree, and the minimum level of loss reduction necessary to partition leaf nodes, based on which achieved the lowest mean squared error averaged across the 3 folds [21]. We then refitted the model to the full training set using the best parameters determined from 3-fold cross validation, which can further help reduce overfitting. We additionally developed random forest and shallow multilayer perceptron models using a similar training procedure, as detailed in the Supplementary Information [31, 32].

Model testing and statistical analysis

We evaluated the performance of the prospective risk adjustment models on the test set. The performance metrics are detailed below.

Goodness of fit

We evaluated model goodness of fit using the coefficient of determination (R2) and the mean absolute error (MAE). We estimated the R2 with confidence intervals using the nonparametric bootstrap with 5000 bootstrap replicates [33], and the MAE with confidence intervals using a paired t-test.

Discrimination

We assessed discrimination using the concordance-statistic (C-statistic), a rank correlation metric for assessing the model’s ability to order members by their spending [34]. The C-statistic estimates the probability that, for a randomly selected pair of members, the member with the higher cost will be correctly predicted as having higher cost by the model [35]. The C-statistic is the generalization of the area under the receiver operating characteristic curve from the binary to the continuous outcome setting, where a result between 0.7 and 0.8 is considered acceptable, between 0.8 and 0.9 is considered good, and above 0.9 is considered excellent [36]. We estimated confidence intervals for the C-statistic using a jack-knife procedure [37].

Subgroup analyses

Risk adjustment models often underpredict spending for specific subgroups of enrollees leading to underpayment to the insurer, and there is evidence that insurers explicitly make health plans less desirable for enrollees in undercompensated groups [38, 39]. To evaluate the performance of the models on vulnerable subgroups, we defined test data subgroups using age, sex, and area-level SDH indicators. SDH indicator subgroups included individuals living in ZIP codes in the lowest decile of household income; lowest decile of education level (by high school diploma and by bachelor’s degree receipt); highest decile of Gini index for inequality; low ratio of income to poverty level; high proportion of households receiving food stamps; high proportion with single parents; high unemployment; high uninsurance rate; and high proportion reporting they do not speak English “very well” (see Supplementary Information Table S1 for decile thresholds).

The performance of the model on each subgroup was measured using the predictive ratio [40] and net compensation [39, 41]. The predictive ratio for a subgroup was computed as the ratio of the mean of observed spending to the mean of predicted spending over the subgroup, where a value above 1 indicates underestimation of cost and a value below 1 indicates overestimation [17]. We estimated 95% confidence intervals around the predictive ratio using the delta method [42]. Net compensation was used as a measure on the dollar scale, and was computed as the mean difference between predicted spending and observed spending over the subgroup, where a value below 0 indicates underestimation of cost and a value over 0 indicates overestimation. We estimated 95% confidence intervals around the net compensation values using a paired t-test.

Analyses were approved by the Stanford Institutional Review Board (eProtocol #42334), and performed in Python version 3.6.6 [43] and R version 3.5.0 [44], using the code shared online for reproducibility at: https://github.com/stanfordmlgroup/risk-adjustment-ml.

Results

Descriptive statistics on the data subsets are detailed in Table 2. The test set had a mean age of 41.1 years (median 41.0; IQR 30.0, 53.0) and was 48.9% female. Top-coding cost at $400,000 eliminated approximately 2.8% of dollars and test set members had a mean top-coded annual healthcare cost of $6677 (median 855; IQR 161, 3847). Around 17.7% of members in the test set had zero annual healthcare cost.

Table 2 Characteristics of Members in the Dataset Subsets

Table 3 shows the test set performance of the prospective linear and machine learning models without and with the SDH indicators.

Table 3 Performance Measures of the Prospective Linear and Machine Learning Models on the Test Set

Linear regression without SDH indicators

The linear regression model without SDH indicators, when derived through ordinary least squares regression, had the largest standardized coefficients (indicating highest importance among covariates in the model) for age and sex indicators and diagnostic coding for birth complications and chronic kidney disease (see Supplementary Information Table S3). The model had a R2 of 0.327 (95% CI 0.300, 0.353), MAE of $6992 (95% CI 6889, 7094), and C-statistic of 0.703 (95% CI 0.701, 0.705). Linear models derived through LASSO had similar performance metrics but tended to favor diagnoses more than traditional least squares (see Supplementary Information Tables S3 and S5).

Linear regression with SDH indicators

The inclusion of SDH indicators in the linear regression model had no substantial effect on the overall performance metrics. The model had a R2 0.327 (95% CI 0.300, 0.354), MAE of $6991 (95% CI 6889, 7094), and C-statistic of 0.700 (95% CI 0.699, 0.702).

Machine learning without SDH indicators

Switching from a linear regression model to the machine learning model significantly improved determination, significantly reduced error, and significantly improved discrimination. Specifically, the machine learning model without SDH indicators had a R2 of 0.388 (95% CI 0.357, 0.420), MAE of $6637 (95% CI 6539, 6735), and C-statistic of 0.717 (95% CI 0.715, 0.718). The multilayer perceptron and random forest models outperformed the linear models but performed worse than the LightGBM model across all metrics (Supplementary Information Table S5).

Machine learning with SDH indicators

The inclusion of SDH indicators in the machine learning model also had no substantial effect on the overall performance metrics above the machine learning model without SDH indicators. The model had a R2 of 0.387 (95% CI 0.357, 0.419), MAE of $6634 (95% CI 6536, 6732), and C-statistic of 0.716 (95% CI 0.714, 0.717). We created variable importance rankings to assist in the interpretation of the machine learning model. Diagnosis predictors had the largest importance metrics in the machine learning model, with the most important predictors being chronic kidney disease, deficiency and other anemia, and other aftercare (see Supplementary Information Table S4).

Subgroup analyses

Table 4 compares the predictive ratios and net compensation values for the machine learning model without and with SDH indicators. The addition of SDH indicators resolved or reduced underestimation of risk on all of the SDH-based subgroups, but the 95% confidence intervals were overlapping between the non-SDH and SDH-including models among all subgroups. On one of the high-poverty subgroups, the subgroup with a high proportion of non-fluent English speakers, the subgroup with a high prevalence of uninsured, and the subgroup of individuals who lived in areas with a large proportion of households on food stamps, the incorporation of SDH indicators resolved the underestimation of risk. Among subgroups of individuals who lived in areas with high poverty, high wealth inequality, and high prevalence of uninsured, the machine learning model trained with SDH indicators substantially reduced underestimation of cost among the subgroup, improving the predictive ratio by 3% (and net compensation by $200 per person) over the model trained without SDH indicators. The addition of SDH indicators led to small additional overpayment on the 4 subgroups for which the model without SDH indicators did not substantially underestimate risk (predictive ratio < 1.01), specifically one of the high-poverty subgroups, the subgroup with a large unemployed population, the subgroup with a low percentage of high school graduates, and the subgroup with a large number of single-parent families. Additional subgroup analyses among all models are presented in Supplementary Information Tables S6, 7, 8.

Table 4 Predictive Ratio and Net Compensation Values of Prospective Machine Learning Models on SDH-Based Subgroups in the Test Set

Additional results

Binned scatter plots of the prospective risk adjustment models on the test set are shown in Fig. S1. We additionally explored the effect of using binary diagnosis predictors instead of counts (Supplementary Information Table S9), the effect of top-coding cost (Supplementary Information Table S10), the effect of including lab results (Supplementary Information Table S11), and the development of concurrent risk adjustment models (Supplementary Information Table S12).

Discussion

We observed that switching from a linear regression model to a gradient boosting ML model significantly improved determination and discrimination and reduced absolute error in cost. We also observed that the inclusion of SDH indicators at the ZIP code-level reduced underestimation of cost among people living in vulnerable areas.

Prior studies have separately investigated whether machine learning and the incorporation of SDH indicators can improve risk adjustment. The use of machine learning for prospective risk prediction in a previous study did not demonstrate substantial improvements over linear regression for a privately-insured population [4]. However, the addition of SDH indicators has been shown to improve concurrent risk adjustment models, including Medicare Advantage Plan quality rankings, Medicare’s Hospital Readmissions Reduction Program penalties, and concurrent annual healthcare spending among a state Medicaid population [17, 45, 46]. In our study, the incorporation of SDH indicators reduced cost underestimation in several vulnerable subgroups, even among a commercially-insured population. Improving predictions of cost within these subgroups is important in order to address persistent inequalities that lead to bias in the estimation of payment [47,48,49].

Our study has important limitations. First, the risk models developed here are unlikely to generalize well to populations outside the U.S. as well as to Medicaid or Medicare populations for whom risk adjustment models may be particularly consequential to avoid adverse selection and maintain competitive and fair markets. However, the methods employed in this study could be used in developing specific models for those populations. Second, similar to other machine learning methods, the modeling approach used in this study is more complex than traditional linear regression. Although this may confer an advantage due to the potential of preventing ‘cheating’, in that machine learning models may be less susceptible to up-coding behaviors intended to inflate risk estimates [2], the complexity might also contribute to difficulty to understand how and why the model made a certain decision [29]. Third, since risk adjustment models are developed on historical data, they tend to perpetuate inequality of past spending trends if no explicit adjustments are made to account for the endogeneity of spending. Prior work has investigated methods to develop fairer healthcare payment models through data manipulation and modeling changes [39, 41, 50], which can be pursued in future studies. Fourth, the SDH indicators used in this study are at the area-level which may lead to bias or ecological fallacy in the risk adjustment models. However, combining the claims data used in this work with individual-level socioeconomic status variables was prohibited for privacy reasons. Fifth, 5-digit ZIP codes are not as homogeneous as Census Tracts or Census Block Groups, which have been used in previous linear regression models assessing SDH-associated effects for Medicaid and Medicare populations [51]. The risk for this study is a potential underestimation of the contribution of SDH to risk models. However, ZIP code is more readily available in commercial claims datasets. Sixth, there remains debate about whether adding in SDH indicators may allow for poorer healthcare to persist in healthcare organizations serving predominantly lower-income populations, by compensating them more in value-based payment models that adjust not only for outcomes but also for lower income for instance, although recent studies suggest this will not necessarily mask hospital quality [52]. Seventh, one key challenge is to predict per-member utilization rather than cost. However, given that cost is a key concern for payers and often disproportionate to utilization due to negotiated contracts and geographic variations in cost, we modeled overall costs to help understand how much geographic parameters such as social determinants and machine learning could capture the complexities related to payment.

In the future, our ML approach may be improved upon in several ways. It may be possible to take advantage of the temporality of the data, for example by including more than one year of medical history. Additionally, it may be possible to train a hybrid (concurrent and prospective) model to leverage the continuous nature of medical enrollment, utilization, and claims [53]. Finally, using highly parameterized models such as deep neural networks could better capture nonlinear interactions between covariates and scale to large claims datasets, at the expense of interpretability [54]. We have shared our code in an open source manner to enable others to reproduce and extend our methods to other datasets.

Conclusion

The results of the current study suggest that machine learning methods and the inclusion of area-level SDH indicators may improve prospective risk adjustment models in a commercially insured population. The SDH indicators were particularly useful for populations living in vulnerable areas, while the machine learning approach had a greater impact on overall performance, leading to improvements in fit, discrimination, and overall cost allocation (>$3 M reduction in error per 10,000 people).