FormalPara Key Summary Points
Why carry out this study?
The sizeable under-reporting of body mass index (BMI) data in administrative healthcare claims databases impedes the comprehensive study of the population with obesity, and improved methodology is needed.
To address this need for improved methodology, we have harnessed machine-learning techniques to interpolate BMI variable data.
What was learned from the study?
Based on this study, machine-learning algorithms can be applied to administrative healthcare claims data to predict BMI classifications with high validity.
This novel approach can be leveraged across multiple therapeutic areas to better understand variations in BMI-related disease risk, treatment outcomes, healthcare resource use, and costs in real-world settings.
The strategic machine-learning approach undertaken in this study may also be relatively easily applied to the development of similar predictive models for other under-reported clinical variables in administrative healthcare claims databases.

Digital Features

This article is published with digital features, including a summary slide, to facilitate understanding of the article. To view digital features for this article go to https://doi.org/10.6084/m9.figshare.13359923.

Introduction

Real-world evidence (RWE) generated from administrative healthcare claims databases are valuable in order to understand patient characteristics, health outcomes, and health economics at the population level [1, 2]. Such administrative healthcare claims database analyses are increasingly being utilized for clinical evidence generation, and they complement the evidence generated from randomized clinical trials and other clinical intervention studies [1, 2]. The findings of claims-based studies are informative to many healthcare system stakeholders, including providers and payers, federal and local government agencies, pharmaceutical/medical device companies, and patients [1, 2]. Although administrative healthcare claims database analyses provide several advantages (e.g, large heterogenous populations, rare event capture, low cost, short time-frames for completion) [2], they also have limitations, including the incomplete reporting of certain clinical variables in the data sources. This creates obstacles to the comprehensive and accurate understanding of patient characteristics and outcomes.

One notable example of such a clinical variable is body mass index (BMI), a biometric measure that has been used in the risk assessment of many health conditions, with a BMI of 30 kg/m2 or greater indicating the medical condition of obesity in adults and greater health risk [3, 4]. National organizations in the US, such as the Centers for Disease Control and Prevention, have stratified obesity into 3 severity classifications, BMIs 30 to < 35 kg/m2, BMIs 35 to < 40 kg/m2, and BMIs ≥ 40 kg/m2, which are reflective of increasing health risks [3, 4]. The BMI is predictive of greater risk for multiple disease conditions, including metabolic syndrome, type II diabetes, cardiovascular disease, some cancers, liver and kidney disease, arthritis, asthma, and depression, as well as a greater risk for all-cause mortality [4,5,6]. Additionally, variations in BMI are predictive of healthcare resource utilization and costs [7,8,9]. The health risks associated with obesity and its high prevalence in the US [4] necessitates the study of populations with obesity on several inter-related facets, such as population sociodemographic and clinical characteristics, current and emerging health outcomes and costs, value of therapeutic interventions, patient–drug/procedure interactions, etc. However, in an administrative healthcare claims database analysis, in which 746,763 health plan members in years 2013–2016 were included, it was reported that BMI value diagnoses were coded for only 14.6% [10]. The sizeable under-reporting of BMI data in administrative healthcare claims databases impedes the comprehensive study of the population with obesity, and improved methodology is needed.

To address this need for improved methodology, we have harnessed machine-learning (ML) techniques to interpolate BMI variable data. ML is a rapidly advancing field and refers to algorithms and statistical methodologies that are used to build analytical models based on systems learning from data, identifying patterns, and yielding decisions [11]. Such statistical tools, including gradient-boosted decision trees, least absolute shrinkage and selection operator (LASSO) regression, random forest, and artificial neural networks (NN), can be applied to raw data sets for the imputation of missing data, replacement of outliers, feature extraction, statistical classification, and optimization of predictive model accuracy [12]. Among other applications, ML techniques have been shown in multiple RWE studies to be useful for model development for the prediction of diagnoses, clinical variables, and disease risk [12,13,14,15,16,17,18,19,20]. The objective of this study was to construct models by implementing ML algorithms to predict BMI classifications (≥ 30, ≥ 35, and ≥ 40 kg/m2) in administrative healthcare claims databases, and then internally and externally validate them, and thereby expand the utility for RWE generation of administrative healthcare claims database analyses.

Methods

Data Sources

Three real-world US administrative healthcare databases were utilized in this study, the Optum PanTher Electronic Health Record database (Optum EHR), the Optum Clinformatics Date of Death (Optum DOD) database, and the IBM MarketScan Commercial Claims and Encounters (IBM CCAE) database. Both the Optum EHR and DOD databases were used for model development and validation purposes. The IBM CCAE database was used as the external validation database in this study. All datasets were from databases of de-identified patient data, and so ethics committee approval was not required.

The Optum EHR multi-dimensional database contains de-identified information on outpatient visits, diagnostic procedures, medications, laboratory results, hospitalizations, clinical notes, and patient outcomes primarily from Integrated Delivery Networks. The EHR data encompass > 80 million patients with ≥ 7 million from each US census region. The database contains a provider network of over 140,000 providers at > 700 hospitals and 7000 clinics with broad geographical representation.

The Optum DOD longitudinal administrative claims database is comprised of claims data from United Healthcare (UHC) fully insured patients, UHC administrative services only, Medicaid, and legacy Medicare Choice membership. The data include integrated enrollment, inpatient, outpatient, and outpatient pharmacy claims for > 80 million unique de-identified members since 2000.

The IBM CCAE database is a longitudinal administrative claims database comprised of de-identified data from individuals enrolled in employer-sponsored insurance health plans. The data include inpatient, outpatient, and outpatient pharmacy claims, as well as enrollment data, from large employers and health plans which provide private healthcare coverage to > 140 million employees, their spouses, and dependents.

Study Methodology Flow

The Optum EHR and the Optum DOD databases were used to supply training datasets for the five advanced ML algorithms that were implemented to construct the predictive models of each BMI classification. The constructed predictive models were then internally validated on the Optum databases and externally validated on the IBM CCAE database. The methodology flow of this study is depicted in Fig. 1 and involved 6 steps, (1) data extraction, (2) feature aggregation, (3) exploratory data analysis, (4) feature engineering, (5) modeling and sensitivity analysis, and (6) model selection. The primary goal was to implement predictive models to interpolate BMI classifications within claims data representative of large populations.

Fig. 1
figure 1

Methodology flow

Data Extraction

All datasets for the study populations were extracted from the Optum EHR, Optum DOD, and IBM CCAE databases during January 1, 2013 to December 31, 2019, based on the latest data available at the time of assessment. All datasets were from databases of de-identified patient data. A BMI reading was identified either from a BMI observation (numeric value) in the Optum EHR dataset or from an International Classification of Diseases (ICD)-9/10 diagnosis code indicating a BMI classification in the claims data sources. Each BMI reading (observation/diagnosis) during the study intake period from January 1, 2014 to December 31, 2019 was indexed on the event date as a reading, so that one person may have contributed multiple readings. Sociodemographic information, including age, gender, US region, and US regional division, and clinical characteristics, including all recorded medical diagnoses, medications, and procedures, were extracted at the index date and during the corresponding 12-month baseline periods, separately for each index BMI reading. Additionally, BMI readings for each quarter prior to the index reading were extracted. Data extraction was performed on disease agnostic populations (i.e., not a subset population with a specific disease). The codes and descriptions of all sociodemographic information and clinical characteristics extracted for the study populations are provided in the online supplement.

Feature Aggregation

The diagnosis codes (ICD-9/-10) and procedure codes [ICD-9/-10; Current Procedure Terminology (CPT-4) codes; Healthcare Common Procedure Coding System (HCPCS) codes] were grouped using Clinical Classification Software (CCS), while medication codes were grouped using the Generic Product Identifier (GPI). Such groupings were used to increase the ease of computation and clinical interpretation.

Exploratory Data Analyses

To understand the distribution of data, extensive exploratory data analyses were performed to identify any data anomalies and reduce data dimensions. Table 1 shows the results of the different models across the three databases.

Table 1 Results of the different models across the 3 databases

Feature Engineering

Given that both the dependent variables (BMI classifications) and all potential features, except age, were dichotomous, random forest methods and Chi-square tests were used to identify the features that were significantly associated with each BMI classification. Firstly, the random forest algorithm was used to rank the features by feature importance score. Then, the top ranked features were cross-validated using the Chi-square test. Feature selection was performed separately in the Optum EHR and DOD databases. Due to the constraint of computation power and the large available sample size, only 2% of random samples from the Optum EHR database and 20% from the Optum DOD database were used at a time in the feature selection analyses. To reduce selection bias, the random samples were bootstrapped 5 times, performing the same analyses in each iteration. Out of the entire 1,266 available features from the Optum EHR and DOD databases, 379 features that were consistently identified across the two databases and 5 iterations were finally selected for the predictive models (Fig. 2). Of the 379 features, again a feature selection process was carried out and the top 100 features were selected (Table 2). Since the models performed better when they were trained on the set of 100 features, all 5 of the ML algorithms were trained using the top 100 selected features.

Fig. 2
figure 2

Process flow of machine-learning algorithm implementation for feature engineering

Table 2 Number of features selected for each BMI classification prediction

Modeling and Sensitivity Analysis

Binary classification models were developed for the following BMI classifications: BMI = 30 kg/m2 (model output = 1, if BMI ≥ 30; = 0 if BMI < 30); BMI = 35 kg/m2 (model output = 1, if BMI ≥ 35; = 0 if BMI < 35); BMI = 40 kg/m2 (model output = 1, if BMI ≥ 40; = 0 if BMI < 40). Considering that some patients may have historical BMI data available in the baseline, which could be a strong predictor, while others do not, 2 models were developed for each of the BMI classifications to account for these 2 scenarios. The first model (model 1) included the baseline BMI feature in addition to the other 100 selected features, and was only trained among patient cohorts with baseline BMI data available. The second model (model 2) was built on the 100 selected features, and was trained on patient cohorts without baseline BMI data. Four mathematically different algorithms were implemented on the models, Catboost, random forest, LASSO, and NN. Catboost and random forest provide nonlinearity due to their tree-based approach, while LASSO is a linear model. NN provide a computation intensive approach based on various activations. With these 4 mathematically different algorithms, we ensured use of varied ML techniques to address our research objective. Additionally, both models 1 and 2 were trained with a novel automated (self-assigned/calculated) weighted prediction approach (Super Learner algorithm; SLA), which leveraged the prediction from the four different ML algorithms through a logistic regression with 5 bootstrapped random samples from the Optum EHR and DOD databases.

In addition to using varied ML techniques, several sensitivity analyses were performed to pursue optimal model performance. First, as previously mentioned, the models were examined using the full 379 features versus only using the top 100 features; the latter yielded a better performance due to less overfitting. Second, model performance was compared when measuring clinical characteristic features on a quarterly basis versus on a yearly basis during the 12-month baseline period. The results indicated better performance using the yearly measured features. In addition, due to the rarity of BMI ≥ 35 and ≥ 40 kg/m2 classifications in the populations, an oversampling technique was applied to improve the model sensitivity; 3 oversampling ratios, 50/50, 60/40, and 70/30, were evaluated while creating the training datasets. Furthermore, hyperparameter tuning was performed for all the algorithms to maximize the performance of the models. Lastly, the models were trained separately on male and female cohorts; however, no significant improvement was observed compared to the models training on the gender-combined cohort.

Predictive Model Performance

The performance of predictive models 1 and 2 was internally examined in the Optum databases and externally tested in the IBM CCAE database (Fig. 3). The performance was assessed by the area under the receiver operating characteristic curve (ROC AUC), F1 score, accuracy, negative predictive value (NPV), specificity, precision, and recall.

Fig. 3
figure 3

Training and testing datasets

Results

The best algorithms of the models and the oversampling ratios across all the iterations of BMI classifications are shown in Table 3. The SLA on the top 100 features was the best ML algorithm for both models with a 50/50 oversampling ratio for the BMI ≥ 30 kg/m2 classification and a 60/40 oversampling ratio for the BMI ≥ 35 and ≥ 40 kg/m2 classifications.

Table 3 Best algorithm of the models and oversampling ratios across all the iterations of BMI classifications

Internal Validation

Implementing the SLA on model 1, and internally validating on the Optum DOD database, yielded ROC AUC values of approximately 88% for the prediction of BMI classifications of ≥ 30, ≥ 35, and ≥ 40 kg/m2, while accuracy ranged from 87.9% to 92.8%, F1 score ranged from 77.3% to 87.7%, and specificity ranged from 91.8% to 94.7% (Fig. 4). Implementing the SLA on model 2, and internally validating on the Optum DOD database, yielded ROC AUC values of approximately 73% for the prediction of BMI classifications of ≥ 30, ≥ 35, and ≥ 40 kg/m2, while accuracy ranged from 73.6% to 80.0%, F1 score ranged from 48.1% to 74.6%, and specificity ranged from 71.6% to 85.9% (Fig. 5). Detailed predictive performance results of models 1 and 2 trained on the Optum DOD database and internally validated on the Optum DOD database are shown in Supplementary Table 1.

Fig. 4
figure 4

Predictive performance results of model 1 trained on the Super Learner algorithm and internally validated on the Optum DOD database; ROC AUC area under the receiver operating characteristic curve, NPV negative predictive value

Fig. 5
figure 5

Predictive performance results of model 2 trained on the Super Learner algorithm and internally validated on the Optum DOD database; ROC AUC area under the receiver operating characteristic curve, NPV negative predictive value

External Validation

The external validation on the IBM CCAE database yielded relatively consistent results with slightly diminished performance, as expected. Implementing the SLA on model 1, and externally validating on the IBM CCAE database, yielded ROC AUC values ranging from 78.7% to 83.6% for the prediction of BMI classifications of ≥ 30, ≥ 35, and ≥ 40 kg/m2, while accuracy ranged from 84.0% to 90.0%, F1 score ranged from 66.9% to 81.8%, and specificity ranged from 90.5% to 95.5% (Supplementary Table 2). Implementing the SLA on model 2, and externally validating on the IBM CCAE database, yielded ROC AUC values ranging from 67.1% to 71.4% for the prediction of BMI classifications of ≥ 30, ≥ 35, and ≥ 40 kg/m2, while accuracy ranged from 69.5% to 74.4%, F1 score ranged from 40.6% to 69.7%, and specificity ranged from 70.6% to 83.7% (Supplementary Table 2). Detailed predictive performance results of models 1 and 2 trained on the Optum DOD database and Optum EHR databases and internally and externally validated are shown in Supplementary Tables 2–5.

Discussion

In this study, we implemented multiple ML algorithms to construct and optimize predictive models, and then applied the models to administrative healthcare claims databases to predict BMI ranges and thereby expand the coverage of BMI data in such data sources. The 2 SLA-based models exhibited the best predictive capabilities of BMI classifications of ≥ 30, ≥ 35, and ≥ 40 kg/m2. Model 1 [ROC AUC values of approximately 88% across the 3 predicted BMI classifications (internal validation); 79–84% (external validation)], which included baseline BMI data, performed better than model 2, which did not include baseline BMI data. However, in the absence of baseline BMI data, model 2 yielded a satisfactory performance, with ROC AUC values of approximately 73% across the 3 predicted BMI classifications (internal validation); 67–71% (external validation). Applying these predictive models to administrative healthcare claims data sources in real-world database studies will potentially produce a better understanding for researchers, healthcare providers, payers, patients, and other healthcare system stakeholders of variations in sociodemographic data, health outcomes, healthcare costs, responses to therapeutic interventions, patient–drug/procedure interactions, etc. among persons with different BMI classifications.

Prior research studies using administrative healthcare data sources have repeatedly shown that BMI is substantially under-reported [10, 14, 21, 22]. Martin et al. conducted a study (2002–2008 patient cohorts) in which an administrative database was referenced to a clinical registry database, and reported low sensitivity (7.75%) but high specificity (98.98%) for detecting obesity based on diagnosis (ICD-10 diagnosis codes E65–E68) in the administrative data source [22]. The obesity prevalence in the administrative database was only 2.4% compared to the 20.3% prevalence observed among a patient cohort in the clinical registry database [22]. In a more recent study (2013–2016) of administrative EHR and claims data (Optum Integrated Claims database), Ammann et al. reported that, among 746,763 plan members, 14.6% had BMI-related diagnoses coded [10]. In this study, the ICD-9/-10 codes had a satisfactory predictive value (> 70%) across different BMI classifications, meaning that the claims-based diagnoses were fairly accurate [10]. However, their sensitivity was relatively low at < 30% [10]. This low sensitivity may in part be attributed to a skewed BMI distribution in the claims data towards the morbid and obese population due to those individuals with a BMI in the normal range or being mildly overweight not having a recorded ICD-9–10 diagnosis and thus being under-represented. Other reasons for limitations of BMI data in administrative healthcare claims databases include that obesity remains under-recognized as an actual disease and that there is a coding focus during data extraction on more obvious clinical disease categories; in physician notes as well, the term obesity is frequently not mentioned or more loosely termed [22]. In light of the findings of Martin et al. and Ammann et al., the predictive models of BMI classifications constructed in our study with ML using 100 key patient features significantly improve the prediction of obesity of patient cohorts represented in administrative healthcare claims databases, especially the sensitivity (i.e., the low sensitivity and skewed BMI distribution only will impact the imbalance of the BMI classifications in the training data, which was partially addressed here through oversampling techniques). The high specificity of the claims-based BMI data ensures a high degree of internal validity during the model development and validation.

Based on cross-sectional National Health and Nutrition Examination Survey data, among adults (> 20 years of age) in the US in 2015–2016, the prevalence of obesity was 39.6% and among youths (2–19 years of age) it was 18.5% [23]. Although the increase in obesity may appear to be stabilizing, at least compared to 2013–2014 survey data in the US, the obese population represents a significant proportion of the overall US population. Large administrative healthcare claims data sources with ML algorithms implemented can supplement such nationwide survey data to provide more complete datasets of subpopulations, in this instance also stratified by BMI classes. Moreover, the large amount of sociodemographic and clinical characteristic data contained in such sources can be helpful to understand the prevalence of obesity, as well as the extent of underdiagnosis, across many different subpopulations (i.e., US geographic regions, age groups, health insurance types, disease categories, etc.). Together with the multitude of comorbidities (diabetes, cardiovascular disease, cancer, etc.) associated with obesity and increased healthcare costs corresponding with higher BMI classifications [4,5,6,7,8,9], it is important to use big data sources to understand at the population-level variations in health outcomes of those with obesity; the stratification by BMI classifications may also provide a more in-depth understanding of the impact of obesity severity on health outcomes. The utility of this RWE generation, particularly when combined with other big data technologies (e.g., genomics, metabolomics, information collected by personal monitoring devices, GPS, Fitbit), can be explored under the infrastructures of health system disease management and public health surveillance and interventions [24, 25].

The results of this study and application of the constructed predictive models of BMI classifications in secondary database analyses have certain limitations. Firstly, the clinical utility of BMI in the assessment of risks for obesity-related comorbidities is well recognized at the population level; however, it can sometimes be less useful in assessing the risks of obesity-related comorbidities among individual patients due to the heterogeneity in fat distribution across people in general, males and females, age groups, race/ethnic groups, etc., [26]. Despite having some shortcomings as a clinical biomarker, BMI is a widely accepted useful tool in clinical practice, especially when applied with other clinical measures of cardiometabolic risk factors [26]. Secondly, the databases we utilized are comprised of administrative healthcare data mostly from a single channel of insured members (all age groups) across the US, and thus, the predictive models may not be generalizable to healthcare systems outside the US, and the performance may vary among subpopulations in specific states or regions, or specific age groups (e.g., pediatric population). However, the Optum DOD database does contain a portion of patients with Medicaid and legacy Medicare Choice membership. Also, the administrative healthcare data sources we used to build and train the predictive models are subject to potential coding errors, inconsistencies, and incompleteness. Additionally, the presence of a diagnosis code on a medical claim does not guarantee positive presence of a disease, as the diagnosis code may be incorrectly coded or included as a rule-out criteria.

While other methodologies, such as simpler regression analyses, may be useful for the prediction of BMI classifications in certain instances, our primary goal was to implement and optimize predictive models to interpolate BMI classifications within claims data representative of large populations. The constructed predictive models provided a robust solution to achieve our research objective based on several strengths. First, we used both EHR and claims data sources, giving both provider and payer perspectives, to select more comprehensively features that were significantly associated with BMI classifications, which helped to reduce the potential intrinsic information bias of using one data source type caused by the data collection mechanism. Second, we constructed 2 models for the scenarios of with and without BMI history to best leverage BMI history to improve model performance. In addition, we implemented 4 mathematically different ML algorithms, and we discovered during this process that some performed better than others. Because of this variability in performance, which by itself provides an assessment of the different ML algorithms, we then combined them into the SLA. The more sophisticated SLA demonstrated superiority over the other singularly used ML algorithms; the SLA has also been shown in other analyses to be the optimal tool for constructing such predictive models [27, 28]. Such an approach may additionally be more efficient when the number of covariates is large for assessing the multiple covariate interactions and correlation terms than simpler statistical approaches. Furthermore, the various sensitivity analyses we conducted strengthened the model selection decisions. Lastly, we externally validated the models in another large nationally representative claims database, which demonstrated the predictive models’ performance stability and external validity. When a study budget and technical infrastructure are not constraining, this approach may be utilized over other simpler techniques to provide the optimal prediction model.

Conclusions

This study demonstrated the feasibility and validity of using ML algorithms to predict BMI classifications in administrative healthcare claims data to expand the utility for RWE generation. Furthermore, it was a relatively straightforward approach to access BMI information in claims-based data sources. This novel approach to predict BMI classifications in administrative healthcare claims data can be leveraged across multiple therapeutic areas to better understand variations in BMI-related disease risk, treatment outcomes, healthcare resource use, and costs in real-world settings, as well as be leveraged for other clinical variables that may be under-reported in administrative healthcare claims data sources.