INTRODUCTION

The distribution of healthcare spending is skewed: a majority of expenditure is incurred by a small proportion of patients.1,2 In part, this is due to appropriate use of resources, but unfortunately, many high-cost patients receive unnecessary or ineffective care.3 Therefore, patients who remain high cost over multiple years, often characterized by complex medical and social needs, have become a target for intervention.4,5Care-management programmes for this so-called high-need high-cost(HNHC) population aim to improve quality of care and enhance cost-effectiveness.5,6,7 Programmes often encompass structured clinical follow-up supervised by an interdisciplinary care team, whether or not in combination with self-management support, pharmaceutical care and patient and caregiver education.7,8 Programme results vary, but a positive impact has been shown on quality of care, healthcare use and subsequently cost.9,10,11

Selecting appropriate patients is essential for the success of such programmes.6,8 Yet, HNHC patients comprise a very heterogeneous population, hindering clear definition of who is HNHC (i.e. case definition) and strategies for patient selection (i.e. case finding). Case definition has been based on prior healthcare utilization (e.g. prior hospitalizations), healthcare expenditure (e.g. top cost decile in previous year) or clinical profile (e.g. comorbidity score).3,13,14,15,16 One method for case finding is through quantitative prediction models.12,13,14 Predicting future healthcare utilization and spending has been important in actuarial science for decades. Consequently, proprietary claims–based risk assessment tools have been developed.15,16 In recent years, the purpose of these tools has shifted from risk adjustment to case finding for care-management programmes in an attempt to improve quality of care.15,16 However, proprietary tools are often commercial and therefore not freely available to those developing a care-management programme.

This systematic review identified and assessed current studies on non-proprietary models predicting future HNHC healthcare use. We aimed to evaluate model performance, appraise risk of bias and assess applicability as part of a case-finding strategy for HNHC care-management programmes.

METHODS

This review was reported according to PRISMA guidelines and prospectively registered17,18 (PROSPERO CRD42020164734). Five databases were searched until January 31, 2021, for development and validation studies of prognostic prediction models with an explicitly described outcome on HNHC healthcare use. Two reviewers (UR, RK) independently performed study selection. We excluded studies not reporting on original research such as reviews, meta-analyses or methodological studies. Further exclusion criteria were as follows: published in a language other than English, population including minors, population with a single diagnosis (i.e. one specific morbidity as a condition), no provision of a definition of high-need high-cost healthcare use or a definition without any measure of cost. The eMethods in the Supplement provides a full description of included databases, study selection, data extraction and data synthesis. Authors of 21 potentially eligible studies were contacted to collect additional data or to answer methodological questions (eTable 3 in the Supplement).19 In five cases, this led to exclusion from this review because they met the exclusion criteria.

Predictors were described based on the type of data source they were derived from (e.g. claims data, survey data) and categorized according to Andersen’s Behavioural Model of Healthcare Utilization.20 This model describes healthcare use as a function of ‘Predisposing’, ‘Enabling’ and ‘Need’ characteristics. First, ‘Predisposing’ characteristics predispose people to healthcare use while not directly linked to such use and are divided into three categories: demographic (e.g. age, sex), social structure (e.g. education, occupation) and beliefs (e.g. values concerning health and illness).20 Second, ‘Enabling’ characteristics facilitate or inhibit use of services and are divided into two categories: family (e.g. income, insurance) and community (e.g. ratio of health personnel to population, urban-rural character).20 Finally, ‘Need’ characteristics are an assessment of whether or not illness requires care by the patient or healthcare provider and is divided into the perceived level of illness and the (clinically) evaluated level of illness.20 The eMethods provide a detailed description of Andersen’s model, its three categories of predictors and their subcategories.

Model performance was evaluated using the reported measures of discrimination (i.e. ability to discriminate between those with and those without the outcome), calibration (i.e. the agreement between observed and predicted outcomes), classification (i.e. sensitivity and specificity) and clinical usefulness (i.e. the ability to make better decisions with a model than without).21,22 Applicability for clinical use was assessed for all validated models by plotting model performance, indicated by their discrimination (expressed as C-statistic), against the expected performance in new patients, indicated by risk of overfitting to the development data (expressed as natural log of the events per variable (EPV)). A C-statistic ≥ 0.7 implies good discrimination.23,24,25 An EPV ≥ 20 (natural log = 3) is considered to imply minimal risk of overfitting.26,27

Risk of bias and concerns regarding the applicability of a primary study to the review question were assessed independently by two reviewers (UR, RK) through Prediction model Risk Of Bias Assessment Tool (PROBAST).28 Outcomes were summarized as ‘high’, ‘uncertain’ and ‘low’ risk or concern.28

RESULTS

Of 5890 unique studies reviewed, 530 were selected for full-text review and 60 met all criteria (eFigure 1). These 60 studies provide information on the development and evaluation of 313 unique models predicting future HNHC healthcare use (Table 1). Fifteen studies (25%) developed a model without further validation; 25 studies (42%) developed a model and conducted internal validation; 15 studies (25%) developed a model and conducted external validation and five studies (8%) validated an existing model. All were cohort studies, of which 20 were prospective (33%). In development cohorts, the population size ranged from 136 to 10,300,856. Most studies (n = 42; 70%) were performed in the USA, of which 11 studies (18%) were based on data from the Centers for Medicare and Medicaid Services (CMS): five on Medicare data, four on Medicaid data and two on a combination. Other studies originated from Canada (n = 8; 13%), Spain (n = 2; 3%), Denmark (n = 1; 2%), Japan (n = 1; 2%), the Netherlands (n = 1; 2%), Singapore (n = 1; 2%), South Korea (n = 1; 2%), Switzerland (n = 1; 2%), Taiwan (n = 1; 2%) and the UK (n = 1; 2%). Most studies used regression analysis (n = 47; 78%), but a substantial number of studies (also) employed artificial intelligence (n = 14; 23%). For each study, a full description of the study population, predicted outcome, sample size, predictors, prediction timespan and performance measures is available in eTable 1 in the Supplement.

Table 1 Summary of Development, Validation and Extension Studies of Identified Models Predicting High-Need High-Cost Healthcare Use

Risk Predictors

Predictors were generally derived from a combination of data sources, with claims data (n = 36; 60%) and survey data (n = 29; 48%) used most often. Classification of predictors according to Andersen’s model showed that all studies used ‘Predisposing’ predictors. In this category, demographics (e.g. age, sex) were most common while beliefs were included in none of the studies. ‘Enabling’ predictors (e.g. income, access to regular source) were used in 23 studies (38%), and ‘Need’ predictors were included in 53 studies (88%). Perceived need was predominantly represented in studies based on survey data. Finally, predictors based on prior cost or healthcare utilization were included in 42 studies (70%).

Predicted Outcome and Associated Timespan

Most studies (n = 36; 60%) estimated the risk for patients to become part of some top percentage of the cost distribution (e.g. top decile) within a mean time horizon of 16 months (range 12–60). Other outcomes were measures of healthcare utilization (e.g. top decile of care visits) (n = 23; 38%) and alternative measures of expenditure (e.g. total healthcare cost) (n = 12; 20%). Most studies (n = 51; 85%) had a prediction timespan of 12 months: they made predictions at baseline for the next 12 months. Four studies (7%) had a shorter prediction timespan while five studies (8%) extended their timespan beyond 12 months with a focus on the prediction of persistence in HNHC healthcare use. Two of these five studies concerned the same model.29,30

Model Performance

Model validation was provided in 45 studies (75%): 25 studies (42%) conducted internal and 20 studies (33%) conducted external validation. Internal validation was most often performed by means of a split-sample analysis (n = 17; 28%), in which a model’s predictive performance is evaluated on a random part of the study sample after being developed on the complimentary part (Table 2). External validation was most often performed by means of temporal validation (n = 15; 25%), in which the validation population is sampled from another time period than the development cohort. Some measure of model performance was reported in nearly all studies (n = 57; 95%). Discrimination was reported in 38 studies (63%), typically using the C-statistic (n = 36; 95%). Calibration was reported in 19 studies, most often by means of a goodness-of-fit test (Hosmer-Lemeshow) (n = 7; 37%). Fourteen studies (23%) reported performance measures on both discrimination and calibration, 24 studies (40%) reported on discrimination alone, five studies (8%) on calibration alone and two on clinical usefulness (3%).

Table 2 Summary of Model Characteristics

Comparison with Proprietary Models

Eight (13%) directly compared their developed models with a proprietary model. One of these studies developed a model with a validated C-statistic of 0.84 while the best-performing model in that study, which included proprietary predictors, yielded a C-statistic of 0.86.31 Another study which compared several prediction models, both proprietary and non-proprietary, showed C-statistics ranging from 0.71 to 0.76 for models estimating the risk for patients to become part of the top decile of the cost distribution in the subsequent year.32

Risk of Bias Assessment

Risk of bias was rated as ‘high’ for 40 studies (67%), ‘unclear’ for 13 (22%) and ‘low’ for the remaining seven (12%) studies (Fig. 1). Within PROBAST subdomains, most studies (n = 37; 62%) scored ‘high’ in the ‘Analysis’ subdomain. However, risk of bias was ‘low’ for the subdomain ‘Participants’ in 38 studies (63%), for the subdomain ‘Predictors’ in 49 studies (82%) and for the subdomain ‘Outcome’ in 45 studies (75%). A full description of PROBAST scores for each study is available in eTable 2 in the Supplement. The high risk of bias in the ‘Analysis’ subdomain was mostly due to issues with the handling of missing data (n = 50; 83%), limited description of the handling of continuous and categorical predictors (n = 48; 80%) and lack of evaluation or reporting of model performance measures (n = 44; 73%). Assessment of the applicability of a study to the review question showed overall ‘low’ concerns for applicability in 40 studies (67%).

Figure 1
figure 1

Prediction model Risk Of Bias Assessment Tool (PROBAST) results on risk of bias and concern for applicability in identified models for predicting high-need high-cost healthcare use. (a)Risk of bias—assessment whether shortcomings in study design, conduct, or analysis could lead to systematically distorted estimates of a model’s predictive performance (b) Concern for applicability—assessment whether the population, predictors, or outcomes of the primary study differ from those specified in the review question.

Applicability for Clinical Use

Of the 45 studies that performed validation, 17 (38%) did not report a C-statistic and eight (18%) lacked information on candidate variables, resulting in 20 studies (44%) that were assessed for clinical applicability in Figure 2. Two studies presented models predicting multiple outcomes, both on healthcare expenditure and healthcare utilization, and are therefore included in this analysis twice.33,34 Among models predicting an expenditure outcome, three showed good discriminative ability, minimal risk of overfitting and low overall risk of bias (M, O and T in Fig. 2).31,35,36 One model (N in Fig. 2) showed good discriminative ability and minimal risk of overfitting but had unclear overall risk of bias.34 Among models with a predicted outcome on utilization, one showed good discriminative ability and minimal risk of overfitting but an unclear overall risk of bias (R in Fig. 2).34 Finally, among models with a prediction timespan beyond 12 months, two demonstrated good discriminative ability but showed a risk of overfitting and an unclear risk of bias (I and K in Fig. 2).37,38 External validation of one of these models (HRUPoRT: High Resource User Population Risk Tool) in a separate study demonstrated strong discriminative ability (C-statistic 0.83) and good calibration (K in Fig. 2).37,39

Figure 2
figure 2

Scatter plot of model performance, indicated by models’ discriminative ability to distinguish those with from those without the outcome (expressed as C-statistic;X-axis), vs. models’ expected performance in new patients, indicated by risk of overfitting to the development data (expressed as natural log of EPV, Y-axis) and risk of bias (ROB). a-d (a) X-axis–C-statistic. (b) Y-axis—natural log of number of events per variable. (c) Horizontal blue line—natural log of EPV 20 (3.0). An EPV ≥ 20 implies minimal risk of overfitting.26,27 (d) Vertical blue line—C-statistic 0.7. A C-statistic ≥ 0.7 implies good discrimination.23,24,25 (e) Risk of bias as assessed through Prediction model Risk Of Bias Assessment Tool (PROBAST).28 U = outcome based on utilization; C = outcome based on cost; EPV = events per variable; ROB = risk of bias; ED = emergency department.

DISCUSSION

We provide an overview of non-proprietary models predicting future HNHC healthcare use. In the identified studies, measures of models’ predictive performance in terms of both discrimination and calibration are not consistently provided and external validation is not regularly performed. Most models estimate the risk for patients to become part of a top percentage of the cost-distribution within the next 12 months; only five studies presenting 12 models (8%) had a prediction timespan of more than a year.

Predictors were derived most often from claims data and in 20 studies (33%) from electronic health records (EHR). One explanation may be the ease of data collection from such administrative sources. Another explanation may be that models combining administrative diagnosis data with predictors based on prior cost have been shown to outperform models using only diagnosis or only cost data for predicting individual future high cost.40 However, use of administrative data has some disadvantages. Specifically, claims data are often incomplete and precision and utility may be hampered by the potential for patient disenrollment.14

Classification according to Andersen’s model showed that none of the studies included predictors from the ‘Predisposing’ subcategory beliefs. Andersen and others argued that the absence of beliefs in clinical and health services research may be explained by poor conceptualization and measurement rather than by irrelevance of the concept.41,42 Predictors from Andersen’s second category (‘Enabling’) were included in only 23 studies (38%), even though prior research emphasized the role of social determinants of health in patients becoming HNHC.43 Inclusion of such characteristics is therefore likely to improve selection of patients amenable to intervention. However, ‘Enabling’ characteristics such as income or the urban-rural character of a patients’ residence are usually not amenable to clinical intervention whereas the level of illness is. A likely explanation for the underrepresentation of ‘Enabling’ predictors in the identified models is the relative unavailability of social determinants in administrative data. Although patient and household interviews are potentially more informative on social determinants, they are costly and time-consuming instruments of data collection with potential for recall bias. Furthermore, not all data sources are available for all study populations. The setting in which case finding is projected is therefore an important aspect in choosing a model because it determines which predictors are accessible and, therefore, which model can be used.

The heterogeneity in predicted outcomes in this review reflects the heterogeneity in HNHC case definition. Some models emphasize high need (e.g. top decile of hospital admissions), while others focus on high cost (e.g. top decile of cost distribution). Yet, modelling high need alone will not fully include high-cost patients and vice versa.44 This can be problematic when a model is used as a sole case-finding strategy for HNHC care management. For example, if a patient has a high predicted risk for belonging to next year’s top decile of the cost distribution, this does not inform us on its underlying causes and whether care management is an appropriate intervention.44 A more hybrid approach to case finding may be preferred, combining quantitative prediction modelling and an individual, more qualitative assessment, thereby facilitating the inclusion of ‘Enabling’ characteristics and beliefs into the selection for care management.45

The prediction timespan of a model is another important aspect in applicability in case finding. Although we identified models that predict who will incur high healthcare costs during the subsequent year with a rather high degree of accuracy, patients who remain HNHC over multiple years are particularly interesting candidates to include in care management programmes.38,46,47 Previous studies have emphasized that only 28-51% of HNHC patients remained HNHC after a year, whereas others had died or returned to a non-HNHC status.38,47 Predicting HNHC persistence rather than HNHC in the next 12 months is therefore more clinically relevant with regard to case finding for HNHC care management. On the other hand, some studies specifically focussed on patients with a temporary HNHC status such as Transient High Utilisers (THUs) in the study by Ng et al. or cost bloomers in the study by Tamang et al. Predicting temporary HNHC status may be particularly difficult but may also provide understanding in the difference between these two populations.38,46

Model performance varied and was not always described in detail. Most studies (n = 45; 75%) conducted validation, yet this was external in only 20 studies (33%) limiting implications for clinical applicability. All studies described some measure of model performance, but often, this was limited to discrimination alone (n = 24; 40%). Prior research has underlined the importance of rigorous validation and performance assessment when developing a prediction model.22 However, even in those studies where both discrimination and calibration were described, assessment of clinical usefulness was largely absent.

Comparing models to proprietary models based on summary statistics alone is precarious (e.g. R2 or the C-statistic). A recent comparative analysis of claims-based tools by the Society of Actuaries, which included 12 models from seven developers, showed that model performance expressed as R2 varied from 20.5 to 32.1%.16 No C-statistics were provided. A comparison of proprietary and non-proprietary models in the same (external validation) sample is much more informative. Studies that directly compared their model(s) with a proprietary model showed relatively comparable discriminative abilities.31,32 Clear advantages of proprietary prediction models are that they are thoroughly validated, can be tailored to a specific setting and are usually well-integrated in specific software.14 Disadvantages are cost of implementation and the relative lack of transparency with regard to the predictive results.14

A high or unclear risk of bias (n = 53; 88%) was predominantly due to issues in the ‘Analysis’ subdomain of the PROBAST tool. An important reason for risk of bias was the inappropriate handling of continuous or categorical predictors (n = 48; 80%), for example, when a continuous predictor was categorized. Although categorization has practical advantages, it leads to an unnecessary loss of information and increases the potential for bias as the categories can be chosen favourably.48 Another reason for risk of bias was the use of missing data as an exclusion criterion (n = 21; 35%) and limited description of how missing data were handled in general (n = 50; 83%). This may lead to selection bias and a potential negative impact on the validity of the model during external validation.49

The question which models are most useful for case-finding strategies is best answered in terms of clinical usefulness (i.e. decision-curve analysis). However, this was rarely done in the included studies and is therefore a gap in current evidence.22,50 Ideally, a model identifies those patients most amenable to intervention and for whom care management has the potential to improve quality of care while reducing or replacing high-cost care.13,51 Various trials demonstrated positive effects of care-management programmes on quality of care, quality of life and healthcare use (e.g. hospitalizations).7,10,11 However, evidence is mixed and one of the main reported challenges is the selection of the appropriate population.7,10,11,45 Prediction modelling can help identify suitable patients at an early stage. Programmes with a focus on transitional care (i.e. the coordination and continuity of healthcare as patients transfer between different locations or different levels of care) may be served with a model based on hospitalized patients (e.g. I in Fig. 2).52 On the other hand, an interdisciplinary (primary) care team may be served better by a model based on the general population (e.g. O or T in Fig. 2).35,38,53 Yet, all these models (I, O and T in Fig. 2) include predictors on prior cost and therefore require data on prior healthcare cost, which are not always available. Validated models employing techniques other than regression modelling were not included in Figure 2 because an EPV could not be calculated. Yet, this does not imply they are less applicable in practice. For example, one study developed models by means of text mining with good discriminative ability and low overall risk of bias, demonstrating the potential of new modelling techniques.54 Thus, the choice of a predictive model, as part of a case-finding strategy for HNHC care management, primarily depends on the setting (e.g. hospital, general population), data availability (e.g. data on prior cost) and specific goals of that programme (e.g. reduction in total cost or number of primary care visits). After considering these aspects, choosing models can be refined based on risk of bias of the study or risk of overfitting.

This study has several limitations. First, a meta-analysis could not be performed because study populations and outcomes vary considerably. Second, studies with populations defined by a single specific disease or condition were excluded (e.g. COPD or Alzheimer) because of focus on models with the potential to be implemented in care-management programmes aimed at a broad target population. However, disease-specific models may provide useful insight into underlying mechanisms of becoming HNHC and may also improve predictive abilities within subpopulations.

This study has several strengths. To our knowledge, this is the first systematic review of non-proprietary prognostic models predicting HNHC healthcare use. Furthermore, we provide practical guidance on how to choose between available models as part of a case-finding strategy for HNHC care-management programmes.

One implication of this review is that, when designing a case-finding strategy for HNHC care management, the most important aspects to consider in choosing a prediction model are setting, data availability and specific goals (i.e. desired outcome) of that programme. Furthermore, future research may benefit from a shift of developing new models to validating and extending existing models.55 Another relevant observation is a relative paucity of models predicting HNHC persistence over multiple years, with important implications in identification of suitable candidates for related care-management programmes, often with long-term horizons. On the other hand, one may argue that specific strategies can be used optionally to prevent either transient or persistent HNHC. Furthermore, incorporating models into clinical practice warrants further research, such as decision-curve analyses to improve assessment of clinical usefulness. Lastly, a hybrid approach of quantitative prediction modelling and individual qualitative assessment may improve case finding of appropriate patients for care-management, by specifically including ‘Beliefs’ and ‘Enabling’ characteristics, such as socioeconomic and psychosocial circumstances.45

CONCLUSIONS

In summary, a variety of models predicting future HNHC healthcare use is available, most often estimating the risk for patients to become part of some top percentage of the cost distribution in the subsequent year. Future research on case-finding strategies for HNHC care-management programmes should focus on validating and extending existing models, develop models that predict HNHC persistence and assess clinical usefulness in order to improve quality of care for this complex patient population.