INTRODUCTION

Promoting self-rated health is important in primary care and population health settings.1,2,3 Examining individuals’ self-assessment of their health shifts our focus as clinicians from treating illnesses to promoting wellness.4 Identifying contributors to patients’ self-assessments of their health may promote health equity in care by helping clinicians prioritize factors that can be addressed to meet patients’ concerns. Self-rated health is a commonly used patient self-assessment measure in clinical and public health data, and is also associated with several hard indicators of health status including health care costs, the presence and severity of chronic disease, and the risk of mortality.3 Because key contributors to an individual’s sense of self-rated health are varied, include social as well as medical factors, and may differ in diverse groups, achieving equity in self-rated health may require active efforts to identify factors that are important within specific subgroups.3,5 However, clinical data rarely collect information related to social factors that may influence self-rated health, and strategies to integrate social and clinical data to better understand concerns in specific groups are lacking.4

As many features contribute to self-rated health, the process of identifying key associated factors may be suited to a big data approach that considers multiple interrelated dimensions that are predictive in diverse groups. Machine learning allows consideration of multiple exposures simultaneously and hypotheses are not prespecified.6 Methods for applying machine learning algorithms to address health equity as a part of prediction have not been fully described.

We applied common machine learning algorithms to analyze data from the Behavioral Risk Factor Surveillance System (BRFSS), to gain insights from a large population-based data resource designed to measure several social, demographic, healthcare utilization, and behavioral factors that might contribute to self-rated health. We examined whether self-rated health could be predicted with good model fit in diverse groups, using features from the BRFSS. Formally, our study objectives were twofold, (1) to build predictive models of self-rated health using machine learning algorithms applied to the 2016 and 2017 BRFSS, and (2) examine the accuracy and insights gained from machine learning models predicting excellent or very good self-rated health in diverse groups, along the lines of sex, race/ethnicity, and in age groups across the life course in the US.

METHODS

Data Sources and Study Population: the Behavioral Risk Factor Surveillance System

We chose the BRFSS as a population-based data source due to the large sample size and sociodemographic data available for examining self-rated health in multiple subgroups. The BRFSS is a cross-sectional, random digit-dial telephone survey of the non-institutionalized civilian population aged 18 years and older in the United States (US). The survey is administered by the Centers for Disease and Control and Prevention (CDC), and fielded annually by state health departments in the 50 states, the District of Columbia, and select US territories.5 Raking weights are used to produce population estimates that adjust for survey non-coverage, non-response, and the probability of being sampled given the geographic location, age, race, and sex of the participant.7

Analytic Sample

We included all participants in the 2017 BRFSS, the most recent data at the time of the analysis (N = 449,492) and repeated the analysis in the 2016 BRFSS sample (N = 484,964) to validate our approach. We anticipated that a legacy of systemic racism in the US, along with the known differences in self-rated health by age and sex, could contribute to differences in model fit, and potentially yield different predictors of self-rated health by race, ethnicity, sex, and age.4 Thus, within each survey year, we analyzed data for the entire cohort, and stratified analyses along three dimensions, by (1) sex, (2) race/ethnicity, or (3) age category: 18–29 years old, 30–39 years old, 40–49 years old, 50–59 years old, 60–69 years old, and 70 years and older.

Study Outcomes

Target Feature

The primary outcome of interest was self-rated health, which was measured as, “Would you say that in general your health is: excellent, very good, good, fair, or poor.” We classified individuals into excellent or very good health, compared to good, fair, or poor health.

Feature Selection, Inclusion, and Exclusion Criteria

A detailed list of 2017 BRFSS features included and excluded from the analysis is presented in Appendix Figure 1. We excluded features related to the landline/cell phone survey sampling components that were age- or sex-specific (e.g., prostate-specific antigen screening, mammography), that were derivative of a feature already included in the analysis, or that were closely related to self-rated health (e.g., health-related quality of life, number of poor mental health days). Lastly, we excluded features that might be unreliable due to missing values (item non-response greater than 50% of population).

To focus our analysis, we used a conceptual model—the Healthy People 2020 framework—to categorize the remaining 51 features for model inclusion that defined seven domains: demographics, which included age, sex, race, geographic division, state of residence, number of adults in the respondent’s household, marriage status, veteran status, number of children, and language spoken; clinical conditions, which included a self-reported history of cancer, asthma, depression, diabetes, stroke, cardiovascular disease, kidney disease, arthritis, COPD, skin cancer, body mass index, angina, or hypertension; functional status, which included difficulty doing errands, difficulty dressing, difficulty walking, difficulty communicating, blindness or deafness; access to clinical care, which included delayed care due to cost, having a primary care physician, insurance status, and having had doctor visit in the previous year; health behavior, which included alcohol use, smoking status, e-cigarette use, use of chewing tobacco, exercise practices, drunk driving, seat belt use, Internet use in last 30 days, daily fruit consumption, and daily vegetable consumption; preventive care, which included having had an HIV test, having identified HIV risk factors, and having had a flu vaccine; and socioeconomic status, which included education attainment, income category, homeownership, employment, and cell phone use.8 To account for differences in scale, all features were formatted as binary dummy variables for analysis.

Data Analysis

Descriptive Data

We presented percentages for the top features of importance by age group (Table 1). Feature importance was determined by machine learning classification described below.

Table 1 2017 BRFSS Descriptive Data by Age Groups

Machine Learning in R

We compared predictions and model fit for three supervised machine learning algorithms applied to 2017 BRFSS data to identify features predictive of “excellent” or “very good” self-rated health, compared to good, fair, or poor health. To build each model, we compared three algorithms, regularized logistic regression, random forest, and support vector machine algorithms in the Caret package of R software, version 3.4.0, using a high-performance computer cluster.9 We split the data into two-thirds training data and one-third testing data. We used the bootstrapping resampling method during model training, which selects a random sample of the population “with replacement,” so that the full population is resampled with each model iteration. To examine model accuracy for diverse groups, we estimated model fit and performance using parameters on accuracy, area under the curve (AUC), and receiver operator curves (ROC) for the entire population and within subgroups. An AUC value provides a summary of model prediction visualized through ROC curves. A perfect model would have an AUC of 1.0 (perfect prediction of self-rated health), while a random model would have an AUC of 0.5 (chance prediction of self-rated health). Each algorithm used the ROC to identify a machine learning “importance factor,” which is the rank list of the top twenty factors that contributed to prediction strength in the model. We examined the number of features identified within each of the seven domains and counted the number of times each feature was identified as one of the top twenty factors contributing to model fit.

Multiple Imputation for Machine Learning Analysis in R

To prevent bias from list-wise deletion in the analyzed data, we imputed missing values using multiple imputation techniques from the MICE (Multivariate Imputations by Chained Equations) package in R.10 The MICE algorithm used a predictive mean matching technique to impute missing values using logistic regression.

The largest sources of missing data were due to missing responses for income (16.7%), HIV testing (12.2%), and daily vegetable intake (10.5%). To determine differences due to multiple imputation of missing data, we compared frequencies for each feature for imputed vs. non-imputed data. The absolute percentage difference between imputed and non-imputed values was between 0.01% (e-cigarette use) and 6.9% (non-Hispanic White race/ethnicity).

Odds of Excellent or Very Good Health

To improve interpretation of models, we used logistic regression to estimate the odds of excellent or very good health compared to good, fair, or poor health. We used the top 20 features identified by machine learning models to fit weighted logistic regression models for each population subgroup, accounting for the complex survey design in SAS via the SURVEYLOGISTIC procedure. To prevent bias from list-wise deletion, we imputed data in SAS via the MI/MIANALYZE procedure.11,12

Comparison to 2016 Data

Survey questions for the BRFSS differ by year. We repeated our process in the BRFSS 2016 survey to understand how our results differed when different covariates were introduced.

RESULTS

Sex, Race/Ethnicity, and Age Trends in Self-Rated Health and Other Descriptive Covariate Features

The weighted prevalence of self-rated health and other study covariates in the 2017 BRFSS is presented in Table 1. Of the 449,492 participants, nearly half rated their health as excellent or very good (49.0%). Most identified their race as non-Hispanic White (83.3%); a majority identified as female (55.8%). The percentage of the population who rated their health as excellent or very good decreased with age (60.2% 18–29 years old, 56.6% 30–39 years old, 52.0% 40–49 years old, 47.9% 50–59 years old, 46.4% 60–69 years old, and 42.0% 70 years and older, p value < 0.001).

The most notable differences in covariate features were seen by age group. The youngest population (18–29 years old) had the highest percentage of any physical activity (77.7%), normal BMI (49.9%), Internet use in the past 30 days (96.5%), low income (29.2%), and the lowest percentage of arthritis (4.6%), diabetes (2.1%), hypertension (9.3%), difficulty walking (3.0%), difficulty doing errands (4.3%), and being married (20.7%). The population 70 of years of age and older had the highest percentage with arthritis (52.7%), diabetes (22.4%), and hypertension (62.4%), and have difficulty walking (28.2%). The population aged 70 and older had the lowest percentage of current smokers (7.6%), self-reported depression (14.1%), and Internet use in the past 30 days (60.9%).

Machine Learning Model Fit

We calculated AUC values for the three machine learning algorithms for the total population (Appendix Table 2). The AUC values were nearly identical for the total population for regularized logistic regression (0.81), random forest (0.80), and support vector machine (0.81). Model fit and importance ranking were similar by race/ethnicity (non-Hispanic Black AUC 0.78, Hispanic AUC 0.79, non-Hispanic White AUC 0.81), and sex (female AUC 0.82, male AUC 0.79), but differed by age. We thus focused our data presentation on results related to age groups across the life course. For all three algorithms, AUCs were highest (0.79–0.83) among mid-life age groups (ages 40–49, 50–59 years) and young elders (ages 60–69 years). Model fit (AUCs 0.72–0.73) was slightly lower in the youngest age category (18–29 years). Due to similar fit across algorithms, we used regularized logistic regression to select importance features due to ease of interpretation.

Top Features Predicting Self-Rated Health from Machine Learning Algorithms

The top importance features identified by the regularized logistic regression machine learning algorithm across all age groups are presented in Figure 1. The top two features predicting excellent and very good health for the young population (18–29 years old) were BMI and depression. Education and income were the top features predicting self-rated health in the populations aged 30–39 and 40–49 years. The top two features predicting self-rated health in the population aged 50–59 were difficulty walking and income; for the age 60–69 population, the top two features were difficulty walking and hypertension; and in those aged 70 and older, the top two features were difficulty walking and Internet use in the past 30 days.

Figure 1
figure 1

Top variables of importance across age groups, 2017 BRFSS. Notes: Seven domains include demographics: age, sex, race, geographic division, state of residence, number of adults in the respondent’s household, marriage status, veteran status, number of children, and language spoken; clinical conditions: a self-reported history of cancer, asthma, depression, diabetes, stroke, cardiovascular disease, kidney disease, arthritis, COPD, skin cancer, body mass index, angina, or hypertension; functional status: difficulty doing errands, difficulty dressing, difficulty walking, difficulty communicating, blindness, or deafness; access to clinical care: delayed care due to cost, having a primary care physician, insurance status, and having had doctor visit in the previous year; health behavior: alcohol use, smoking status, e-cigarette use, use of chewing tobacco, exercise practices, drunk driving, seat belt use, Internet use in last 30 days, daily fruit consumption, and daily vegetable consumption; preventive care: having had an HIV test, having identified HIV risk factors, and having had a flu vaccine; socioeconomic status: education attainment, income category, homeownership, employment, and cell phone use.

We counted the number of features identified within each covariate domain and plotted the number in Figure 2. For all groups, socioeconomic status features were important predictors of self-rated health across the life course and were identified most frequently by the regularized logistic regression machine learning model as predictors of self-rated health in mid-life (in the 30–39, 40–49, and 50–59 years age groups). Health behaviors and health care access were identified as important predictors of self-rated health for the population aged 18–29 years old. Comorbidities and functional status were important predictors of self-rated health for those aged 70 and older. Race/ethnicity was most frequently identified as a predictor of self-rated health in younger age groups (18–29, 30–39).

Figure 2
figure 2

Top variables of importance by domain and age group, 2017 BRFSS. Notes: Data analysis performed with regularized logistic regression machine learning algorithm.

Odds of Excellent or Very Good Health in Frequentist Logistic Regression Models

We estimated the odds of excellent or very good self-rated health by age group, using the top 20 features that were important to predicting self-rated health, as identified by the regularized logistic regression machine learning algorithm (Table 2). The frequentist logistic models reduced to 20 importance features demonstrated good model fit (AUC values 0.72–0.82).

Table 2 2017 BRFSS Weighted Odds Ratios (95% CI) of Excellent or Very Good Self-Rated Health by Age Groups

Higher income and education increased the odds of excellent or very good self-rated health in all age groups (Table 2). Additionally, physical activity, self-reported depression, having difficulty concentrating, and the presence of hypertension all predicted the odds of excellent or very good self-rated health in all age groups. Increasing BMI was associated with decreasing odds of self-rated health in all groups except the oldest BRFSS participants (70+). Non-Hispanic Black race was associated with lower self-rated health in mid-life and older groups (age 50–59, age 60–69, age 70+). Sex was not identified as a predictive feature in any model. Full results for weighted logistic regression models by age groups are presented in Appendix Table 3.

Replication in 2016 Data

In the 2016 BRFSS data, patterns for model fit, importance ranking, and prediction in logistic regression were substantially like the 2017 data except for the few features which were only captured in specific years. For instance, dental health factors (last time you saw a dentist and number of teeth extracted) were ranked as important predictors of self-rated health, and this was surveyed in 2016 but not 2017.

DISCUSSION

Machine learning models are increasingly used to gain population health insights, though explicitly training models to provide insights on factors that matter for health in diverse populations has not been commonly reported. In the BRFSS, we found three common machine learning algorithms provided good model fit and predicted similar features as contributors to excellent and very good self-related health. Notably, we anticipated that racial and ethnic differences may have influenced how well algorithms predicted self-rated health, as well as the types of features that were important predictors of self-rated health in specific racial and ethnic groups. However, in these analyses, we found good model fit and predictions using these approaches that were similar by race/ethnicity and by sex. Instead, we found the factors that predicted self-rated health differed across the life course, where diseases of aging were clearly prominent in older groups, access to care and health behaviors were most predictive of model fit among the young, and socioeconomic indicators predicted self-rated health in all groups, but particularly in mid-life. In regression models that took advantage of the data reduction and refinement suggested by machine learning models, we were able to predict self-rated health with a moderately high degree of accuracy in most groups studied.

Our analysis of BRFSS data highlighted the importance of socioeconomic conditions in mid-life. The influence of socioeconomic status on health and well-being in mid-life has been previously recognized in the life course literature.13 The cumulative effects of social conditions may become apparent in mid-life, while access to Medicare and survivorship bias potentially explains the decline in the impact of socioeconomic status in older age.14

Though the dimensions measured in the BRFSS are interrelated, visualizing the association of specific risks at different points across the life course reinforces the idea that a “precision health” approach is needed to influence self-rated health and well-being at specific stages of the life course, tailored and matched to population need.

We were able to identify the importance of socioeconomic factors to self-rated health due to their availability in public health data. Our findings underscore the importance of collecting socioeconomic data and asking patients about socioeconomic factors in clinical settings for use in electronic health records (EHRs), to make this information available in patient care settings for algorithmic prediction, and to plan for life-stage appropriate interventions that promote wellness for patients, including community partnerships that address social determinants of health.15

Limitations

A critical limitation of all machine learning models is that models capture associations that are not causal. For example, though we identified BMI and depression—alongside a series of factors—as predictors of self-rated health in young adults, it is not clear that intervening on these factors alone or in combination would influence self-rated health, or whether there is reverse causation, where those with poor self-rated health go on to develop higher BMI and depression. Thus, insights from machine learning models may be helpful for identifying relationships within data, but additional strategies, including implementation studies and other methods, are needed to develop actionable strategies for self-rated health interventions.

Second, our models were inherently limited by features available in BRFSS data. We noted that dental health predicted self-rated health in the 2016 data but were not available in the 2017 data. The limitations of data availability make it useful to employ a conceptual model to clearly outline the factors that are present or missing in models to facilitate data interpretation. For this purpose, we have chosen the Healthy People 2020 framework to identify relevant dimensions for analysis. Using this framework, we recognize many factors such as neighborhood and built-environment (segregation, area-level poverty, or deprivation), social and community context (social relationships and social capital, experiences of discrimination), detailed clinical data, and other factors not readily linked to individual-data surveyed in the BRFSS may affect the conclusions drawn here. One strength of our findings is the high AUC values suggesting good model fit with data that are available in the BRFSS. Additionally, though a strength of the BRFSS is the availability of rich survey data, including social and health status features, the BRFSS population represents those who made themselves available for telephone interviews, and may not fully represent the US population. Replicating these strategies in population health data may provide information that is relevant to clinical populations.

Importantly, using machine learning models as explicit tools to examine potential contributors of health inequities along the lines of race/ethnicity, sex, and other dimensions is a new area of research, and a strength of our current analysis.16,17,18 The key strategies that we employed here: (1) using data that captured socioeconomic factors, and (2) stratifying the training and testing of machine learning algorithms in prespecified groups, we were able to confirm that models performed similarly by race/ethnicity and sex. We were also able to identify life course patterns that deserve further study and potential intervention. Future studies should also examine more detailed data on sexual orientation and gender identity (SOGI), which are becoming increasingly available in public datasets.

CONCLUSIONS

Using machine learning models in population-based data, we identified 20 out of 51 factors that are especially predictive of self-rated health in specific groups across the life course, with a moderately high degree of model fit. Though the findings in this study show similar predictions by race, ethnicity and other demographics, the strategy we propose to confirm and validate the accuracy of predictions in diverse groups is important to test assumptions that models are relevant to subgroups. As machine learning models are increasingly used in clinical and population health settings, it is important for clinicians, researchers, and population health managers to become facile with the use of strategies to ensure equity in the application of these methods.19

The strategies used here may enhance equity in the use of machine learning models when applied to EHR data, including (1) ensuring data sources capture relevant social, demographic, and contextual data on which to base predictions, (2) using conceptual models to provide transparency on factors that contribute or are left out of predictions, (3) stratifying models within specified subgroups to monitor the accuracy of model predictions in subgroups, and (4) reporting data by subgroup to inform interventions that may be needed within specific groups. Future research should examine these strategies for use in clinical data to enhance equity in prediction for diverse populations.