INTRODUCTION

Non-alcoholic fatty liver disease (NAFLD) is the most common chronic liver disease in westernized societies,1 with a global prevalence of 25%.2 NAFLD represents a spectrum of disease from fat accumulation in the liver, to inflammation and progressive fibrosis, and eventual progression to cirrhosis and hepatocellular carcinoma.3,4,5 In addition, recent evidence showed that NAFLD complications were not only confined to advanced liver disease but may also contribute to major extrahepatic conditions.6,7 These include a 2-fold increase in the risk of incident type 2 diabetes,8,9 and furthermore, cardiovascular disease (CVD) and extrahepatic malignancies account for a greater proportion of mortality than liver disease.10

To diagnose NAFLD, there must be evidence of hepatic steatosis by imaging or histology, and absence of secondary causes.4 Performing a liver biopsy, considered the gold standard for NAFLD diagnosis, would be costly and unnecessarily invasive and is not feasible in the general population.4 Furthermore, in view of the high prevalence of overweight and the metabolic syndrome, assessing all patients at risk for NAFLD using imaging is likewise not feasible.11 A simplified algorithm to screen patients at high NAFLD risk is therefore desirable. In clinical settings, clinicians could prioritize who should receive an imaging study. Likewise, in research settings, investigators could identify high-risk study participants.

Previous risk scores have been developed for detecting hepatic steatosis12,13,14,15,16,17 and some have been externally validated.18,19,20,21,22 However, all were developed in racially homogeneous populations so it is unclear how they would perform in a heterogeneous population in the USA. Furthermore, all risk scores included clinical laboratory markers, which are not always readily available. The aim of this study was therefore to develop, in a large multi-ethnic cohort in the USA, a practical scoring tool for predicting NAFLD risk based on participant demographics, medical history, anthropometrics, and routine lab values, referred to as the NAFLD-MESA Index. Furthermore, since laboratory tests are sometimes not readily available or feasible to measure, a secondary aim was to develop a second NAFLD-Clinical Index without laboratory variables. And lastly, we compare the performance of our two models against the fatty liver index (FLI),12 which we additionally validate in our sample, to quantify any observed difference in classification performance.

METHODS

Data Source

The Multi-Ethnic Study of Atherosclerosis (MESA)23 is a well-characterized cohort of 6814 participants aged 45–85 and free of known CVD. Established in 2000, participants were recruited from six US communities (Columbia University, New York; Johns Hopkins University, Baltimore; Northwestern University, Chicago; UCLA, Los Angeles; University of Minnesota, Twin Cities; and Wake Forest University, Winston-Salem). Racial/ethnic distribution was as follows: 38% white, 28% African American, 22% Hispanic, and 12% Chinese American. Informed consent was obtained from all study participants and institutional review board (IRB) approval was obtained by the MESA sites. Ethics approval for the use of anonymized data was obtained from the UCSF IRB on 2 January 2018 (16-21085).

Sample Population

We excluded participants whose computed tomography (CT) imaging did not extend inferiorly sufficiently to measure liver fat attenuation (n = 78); participants with a high alcohol use (average > 1 serving/day in women and > 2 servings/day in men) (n = 343), liver cirrhosis (n = 9), and use of oral steroids or class 3 antiarrhythmic medications (n = 103).24 Our final sample size was 6194 from the baseline visit between 2000 and 2002.

Outcome Measure: NAFLD

At the baseline visit, participants received two consecutive CT scans, which included liver images.24,25 Liver attenuation by CT scan has been shown to be inversely correlated with liver fat deposition by liver biopsy (correlation coefficient: − 0.9; p value < 0.001), showing that CT scanning provides a useful non-invasive method for identifying moderate to severe fatty liver.26 We used a previously validated threshold of ≤ 40 Hounsfield units (HU) for the identification of a binary classification of moderate to severe hepatic steatosis (> 30% liver fat).27,28

Potential Predictors

Fourteen candidate predictors were identified a priori based on their known association with NAFLD29,30 or components of the metabolic syndrome31 and their availability. These included the following: body mass index (BMI), waist circumference (WC), waist-to-hip ratio, age, sex, race/ethnicity, education, smoking history, recent weight change, gamma-glutamyltransferase (GGT), triglycerides, type 2 diabetes, high-density lipoprotein (HDL)-cholesterol, and hypertension.

Predictor Measurements

Anthropometric measures were taken using standardized procedures.32 BMI was categorized according to established criteria33,34,35: normal weight (< 25 kg/m2), overweight (25–< 30 kg/m2), and obesity (30–< 35 kg/m2 grade 1, ≥ 35 kg/m2 grade 2) for white, African American, and Hispanic participants, and normal weight (< 23 kg/m2), overweight (23–< 27.5 kg/m2), obesity (≥ 27.5 kg/m2) for Chinese Americans. WC was measured and categorized into three groups36: < 88, 88–102, > 102 cm. Age was categorized into decade groups and further modified into three categories to maximize discrimination ability: 45–< 65, 65–< 75, and 75–85. Sex was self-reported. Highest achieved education was categorized into the following: less than high school, high school, some college, bachelor’s degree, or higher. Race/ethnicity was self-reported. Smoking history was categorized as never, former, or current. Recent weight change was calculated comparing measured weight at study baseline against self-reported highest weight over the prior 3 years and calculated as percentage of weight loss/gain. GGT was categorized into quartiles according to units per liter (< 5, 5–< 8, 8–< 14, and ≥ 14). Triglycerides were measured in the fasted state and initially categorized according to established criteria,36 but then modified into the following three categories to improve discrimination: < 75, 75–< 150, ≥ 150 mg/dL. Type 2 diabetes was defined as fasting glucose ≥ 126 mg/dL and/or on any diabetes treatment. HDL cholesterol was classified using the ATP III criteria < 40 mg/dL and ≥ 60,36 with intermediate categories 40–49 and 50–59 mg/dL. Resting blood pressure was measured in the seated position. Missing data was observed in less than 2% of these predictor variables.

Statistical Analysis of Characteristics

Baseline characteristics, anthropometric data, and clinical parameters are reported as means and SD or median and interquartile range depending on their distribution, or as counts and proportions.

Risk Score Derivation

To select the optimal subset of predictor variables that minimize error in NAFLD prediction, we used a conditional random forest classification algorithm that accounts for variable correlation in the importance calculation. Estimation was based on the R party package37 using the full sample. Random forest classification is a nonparametric, ensemble classification tree method that incorporates bootstrap aggregation in the assessment of variable importance.38 From the original 14 variables, the random forest identified nine predictor variables that were most influential in minimizing prediction error. WC was identified as an important variable but was subsequently removed from the final set of predictor variables because it is not regularly or accurately measured in routine clinical settings; furthermore, including it did not significantly improve the model performance, thus leaving eight variables for the final model.

To develop and validate our final model, called the NAFLD-MESA index, we selected a random 2/3 of the sample (n = 4151) for model training and 1/3 (n = 2063) for model validation. A risk score for the final multivariate model was derived using a modified version of the Framingham Heart Study approach.39 Briefly, a logistic regression model was fitted to the NAFLD outcome using the eight predictors. Model coefficients were then converted to points, with 1 point indicating the risk equivalent to the smallest coefficient (type 2 diabetes). A total risk score was then calculated for each participant by adding all points from the eight variables. A detailed algorithm describing risk score point derivation is included in the Appendix in the Supplementary Information. We assessed presence of two-way multiplicative interaction in separate models using likelihood ratio tests, including between race and BMI, sex and BMI, sex and age category, and sex and smoking. None of these was statistically significant at the 5% level, so the final model included only main effects.

We used a similar approach to construct the second model excluding laboratory variables (GGT and TG). The smallest coefficient in this case was being a former smoker. A chi-squared test comparing the estimated AUC was used to compare the two models; we present Bonferroni-adjusted p values given that multiple pairwise differences were tested, including stratified models by race/ethnicity.

Internal Discrimination and Calibration

To assess discrimination ability, we constructed ROC curves and calculated sensitivity, specificity, interval likelihood ratios, and estimated post-test probability of NAFLD at various intervals. The intervals were selected from visually inspecting the ROC curves to identify slope changes. Interval likelihood ratios were obtained by dividing the proportion of participants with NAFLD over the proportion of participants without NAFLD in each interval. Calibration performance was assessed on the validation sample using Hosmer-Lemeshow goodness-of-fit measures40 and Brier scores41 and graphically using a calibration plot by grouping participants into quintiles of NAFLD risk and plotting the average predicted risk of each quintile against the average observed risk.

Model Performance Compared to the Fatty Liver Index

We compared the performance of our NAFLD-MESA and NAFLD-Clinical Index models against the FLI. The FLI includes BMI (kg/m2), WC (cm), and log-transformed serum TG (mg/dL) and GGT (U/L)12 to obtain a score between 0 and 100 based on a logistic model. We compared the AUC of our models against the FLI using a chi-squared test. In sensitivity analysis, for a fairer comparison, we modified the FLI predictors (e.g., made them categorical) to potentially better fit our data and improve its discrimination performance. Analyses were conducted using R version 3.6.1 (Vienna, Austria) and Stata 15.0 (College Station, TX, USA).

RESULTS

Participant Characteristics

A total of 6194 participants were included in the study. Participants included in the derivation and validation sets had similar distributions of important covariates (Table 1). Participants with NAFLD were younger and had more components of the metabolic syndrome including a higher BMI, WC, TG, systolic blood pressure, and GGT. In addition, participants with NAFLD were more likely to be Hispanic, have a lower educational background, have type 2 diabetes, and be never smokers.

Table 1 Characteristics of Study Participants With and Without NAFLD in the Development and Validation Samples

NAFLD Predictors

The final logistic regression model for the NAFLD-MESA model included BMI, GGT, TG, sex, smoking, age, type 2 diabetes, and race/ethnicity. Our second NAFLD-Clinical Index model excluded GGT and TG (Table 2). When coefficients were converted to risk score points, high levels of TG or BMI had the greatest risk contribution. In the second NAFLD-Clinical Index model, high BMI and younger age category had the greatest risk contributions.

Table 2 NAFLD-MESA Index and NAFLD-Clinical Index Predictors (Derivation Set, n = 4131)

Discrimination

ROC curves were constructed using the point-based system and AUC estimated with NAFLD. In our full NAFLD-MESA model, the derivation set achieved an AUCNAFLD-MESA = 0.83 (95% CI, 0.81 to 0.86) and the validation set an AUCNAFLD-MESA = 0.80 (0.77 to 0.84) (Fig. 1). Our NAFLD-Clinical Index model performed marginally lower AUCClinical = 0.78 [0.75 to 0.81] in the derivation set and AUCClinical = 0.76 [0.72 to 0.80] in the validation set (pBonferroni-adjusted < 0.01) (Fig. 2).

Figure 1
figure 1

Area under the receiver operating characteristic curve using the point system on the derivation (n = 4151) and validation (n = 2063) model 1: NAFLD-MESA. AUC, area under the curve.

Figure 2
figure 2

Area under the receiver operating characteristic curve using the point system on the derivation (n = 4151) and validation (n = 2063) model 2: NAFLD-Clinical Index. AUC, area under the curve.

We provided the interval likelihood ratio and post-test probability at each two-unit interval for both models (Table 3). We considered a post-test probability of NAFLD greater than the average pre-test probability (prevalence) as suitable cut-offs for higher suspicion of NAFLD. In the NAFLD-MESA index, this corresponded to a binary cut-off of ≥ 22 points which had a sensitivity of 75%, a specificity of 72%, and a post-test probability of > 8%. Similarly, in our NAFLD-Clinical Index, the corresponding binary cut-off was ≥ 20 points, which had a sensitivity of 80%, specificity of 60%, and post-test probability of > 8%.

Table 3 NAFLD-MESA Index and NAFLD-Clinical Index Interval Table

Internal Calibration

In our NAFLD-MESA model, the Hosmer-Lemeshow goodness-of-fit test had a p = 0.24, and the Brier score was 0.053 for the validation set, indicating that our validation model had acceptable calibration and prediction performance. In our second NAFLD-Clinical Index model, the Hosmer-Lemeshow goodness-of-fit test had a p = 0.39, and the Brier score was 0.05. Graphically, we found that both the NAFLD-MESA and NAFLD-Clinical Index models slightly overestimated risk overall, but the estimates by quintiles were close to the line of equality (Appendix Figures 1 and 2 in the Supplementary Information).

Comparison with the Fatty Liver Index

Compared to the FLI, when applied to our full cohort (n = 6194), our NAFLD-MESA index outperformed the FLI (AUCNAFLD-MESA = 0.83 [95% CI: 0.81, 0.85] vs. AUCFLI = 0.78 [0.76, 0.80]; pBonferroni-adjusted < 0.01). On the other hand, our NAFLD-Clinical Index model performed similar to the FLI (AUCClinical = 0.78 [0.75 0.80]; pBonferroni-adjusted 1.00) (Table 4). In race/ethnicity stratified analyses, we found that our NAFLD-MESA index also performed better than the FLI among African Americans (AUCNAFLD-MESA, African Americans = 0.83 [0.78, 0.88] vs. AUCFLI, African American = 0.79 [0.73, 0.84]; pBonferroni-adjusted 0.01) and Hispanics (AUCNAFLD-MESA, Hispanics = 0.79 [0.76, 0.83] vs. AUCFLI, Hispanics = 0.74 [0.70, 0.78]; pBonferroni-adjusted < 0.01), though similar in whites and Chinese Americans (Appendix Table 1 in the Supplementary Information). In sensitivity analysis, modifying the FLI predictors to improve its performance in our data resulted in minimal AUC changes of less than 1% (data not shown).

Table 4 Comparing the AUC of the NAFLD-MESA and NAFLD-Clinical Index Models Using the Point System to the AUC of the Fatty Liver Index Using the Regression Equation According to Bedogni et al.

DISCUSSION

In this large population-based cross-sectional study of white, African American, Hispanic, and Chinese American adults over the age of 45 years, we developed two practical indices that use a point-based system to discriminate between individuals with and without NAFLD with good precision. The index showed adequate discrimination, supporting its use in clinical settings to prioritize who should be referred for an imaging study for NAFLD diagnosis. Likewise, in research settings, researchers can use the index to identify high or low NAFLD risk individuals. We also developed a NAFLD-Clinical Index excluding biomarkers (GGT and TG) and found it to perform only marginally lower than the full NAFLD-MESA index, indicating its use appropriate when laboratory tests are not readily available.

Machine learning can allow the identification of highly predictive variables, which otherwise may have gone unexplored using traditional methods such as stepwise logistic regression.42 The algorithm identified similar variables included in prior risk models and also additional variables not previously included (sex, age, and smoking). Equally important, there is usefulness in developing models that are interpretable and easy to implement in practice. For instance, the performance of our final simplified models was similar to regression models that allowed for nonlinearities of continuous variables without categorization, or to analog models based on the random forest (data not shown). Consistent with prior studies,29,30 BMI, GGT, TG, type 2 diabetes, and race/ethnicity were independent NAFLD predictors. Age and sex also have been associated with NAFLD, but their association varies across the life-course. NAFLD prevalence increases with age until about 50 years, particularly among men.43,44 In populations < 50, men generally have a higher NAFLD risk compared to women, whereas among post-menopausal women, the risk of NAFLD has been found to be similar to men their age.43,44 NAFLD prevalence decreases after about the age of 50 in men and around the age of 70 in women.44 These findings are consistent in MESA with adults > 75 years having the lowest risk of NAFLD. In our model, women had a slightly higher risk than men. On the other hand, the association between cigarette smoking history and NAFLD is less clear. In MESA, current smokers had the lowest NAFLD prevalence, and this was consistent across different strata of BMI (data not shown). As we did not control for other ectopic fat stores in our models, it is possible for residual confounding to explain at least part of this inverse association.

Our models share predictors used in a number of NAFLD indices previously developed among homogeneous populations outside of the USA.12,13,14,15,16,17 Because of unavailability of some variables, we were only able to validate the FLI12 in our sample. We found that our NAFLD-MESA index outperformed the FLI by 5% of AUC, and our NAFLD-Clinical Index had a comparable performance compared to the FLI. Notably, the FLI was developed in an Italian population12 and has been validated in East Asian populations,18,19,20,21,22 but this is the first time that a NAFLD risk score approach is developed on an ethnically diverse sample. It is noteworthy noting that the NAFLD-MESA index performed particularly better among African American and Hispanic adults, but no better than the FLI among whites and Chinese Americans, highlighting the importance of including race/ethnicity as a risk factor for NAFLD.

Our study has limitations. First, NAFLD diagnosis in our study was based on CT scans which are insensitive to mild hepatic steatosis.27,28 This resulted in outcome misclassification of mild NAFLD cases. Furthermore, steatosis tends to be reduced with progressing fibrosis, also leading to misclassification. Second, we could not evaluate non-alcoholic steatohepatitis due to the lack of histologic data. Third, MESA did not measure other liver enzymes, potentially leading to a lower model performance. Fourth, we were unable to compare our model performance against other NAFLD indices as we lacked the necessary variables. Lastly, we were unable to externally validate our index in contemporary clinical populations.

The 2016 European Associations for the Study of the Liver, Diabetes, and Obesity recommend screening high-risk individuals (e.g., with obesity) for NAFLD by liver enzymes and/or ultrasound as part of routine work-up.45 Likewise, the American Diabetes Association recommends routine screening of non-alcoholic steatohepatitis and liver fibrosis in patients with type 2 diabetes and fatty liver on ultrasound.46 Due to the high prevalence obesity or the metabolic syndrome, routine screening for NAFLD would likely overwhelm imaging services. And importantly, the sensitivity for NAFLD using a BMI > 30 kg/m2 is likely too low, especially for those of Asian origin, who have a lower BMI distribution. The NAFLD-MESA index addresses this limitation, making it easier to identify high-risk individuals and reduce the proportion of patients referred to imaging studies. For instance, by applying the NAFLD-MESA index cut-off to MESA, only about 1/3 of the population would be referred to imaging, compared to about 75% of individuals with high BMI and/or type 2 diabetes. Nevertheless, we agree with prior authors47 that further research should evaluate if targeted NAFLD screening using a tool such as this one is cost-effective.

CONCLUSION

In conclusion, the NAFLD-MESA and NAFLD-Clinical indices adequately discriminate between individuals with and without moderate to severe NAFLD and perform better or similar to the previously validated FLI. These indices can aid clinical decision making by risk stratifying and referring to those at high risk for imaging studies. Likewise, in research settings, this index may aid in identifying high-risk individuals in research studies.