Background

Approximately 20–25% of patients with untreated hepatitis C virus (HCV) infection will progress to cirrhosis within 30 years. Though individual rates of progression vary depending on comorbidities and other risk factors [1], modern direct acting antiviral (DAA) medications have outstanding efficacy and can eradicate HCV in nearly all cases [2]. Viral eradication clearly reduces progression to cirrhosis and lowers the risk of mortality [3,4,5,6]. However, the rate of hepatic fibrosis regression varies between individuals. In some, liver disease may continue progressing even after successful antiviral treatment, particularly given the emergence of non-alcoholic fatty liver disease [7].

Hepatic fibrosis and necroinflammation are powerful predictors of future disease progression [8]. Liver biopsy, historically the criterion standard for assessment of hepatic fibrosis, is invasive, costly, and associated with complications, making it impractical for routine monitoring of all patients with HCV [9]. Transient elastography, while promising, is not universally available and may not be obtainable in low-resource settings [10]. Many cross-sectional studies have tried to use laboratory data or other non-invasive methods to stage hepatic fibrosis in individuals with HCV at a single time point, yet few have aimed to predict future liver disease progression and none have incorporated dynamic antiviral treatment status into predictive models [11]. As a result, available cirrhosis prediction models have unknown generalizability to the expanding group who achieve sustained virologic response (SVR) after HCV antiviral therapy. No available laboratory-based models accurately predict the risk of progression to cirrhosis after SVR, with the result that life-threatening liver disease complications, such as hepatocellular carcinoma or esophageal varices, could develop and progress undetected after antiviral treatment.

The few cirrhosis predictive models that do exist have methodologic limitations and only achieve marginal discrimination in predicting fibrosis progression [12,13,14]. Most importantly, earlier models using traditional regression-based methods reduced laboratory data to a single value (e.g., baseline value, mean, maximum, minimum, etc.) [12,13,14,15]. This approach obscures the trajectory of key laboratory data, often an important clue to the ongoing development of cirrhosis [14]. We sought to develop an improved cirrhosis prediction model using survival analysis, an ideal technique given the variable length of time until development of cirrhosis, while also incorporating the full spectrum of laboratory data and HCV treatment status.

Methods

Study population and data collection

We obtained data from the VHA Corporate Data Warehouse, a continually updated electronic repository of demographic, laboratory, pharmaceutical, and other clinical data for Veterans under VHA care. We identified all patients in the VHA system with a history of HCV, defined as the lifetime presence of at least one positive HCV RNA test from January 1, 2000, to January 1, 2016 (n = 280,494). We defined HCV treatment as the receipt of at least one dose of an antiviral medication approved by the US Food and Drug Administration for the treatment of HCV on or before December 31, 2015. SVR was defined as of December 31, 2016, as the permanent absence of detectable HCV RNA after antiviral treatment. Patients were followed for the development of cirrhosis or death through January 1, 2019. We required patients to have at least two AST-to-platelet ratio index (APRI) scores (n = 231,566). APRI is a widely used non-invasive method for assessing fibrosis stage among patients with HCV, with excellent accuracy in detecting advanced fibrosis and cirrhosis. We defined APRI using the standard formula APRI = 100*(AST (U/L)/40)/platelet count (1000/µL) [16]. Component laboratory values were required to be drawn within 30 days of one another and could occur in inpatient or outpatient settings.

Cohort entry was defined as the date of the first APRI. Time-zero was defined as the time of entry into the cohort (Fig. 1). We excluded patients with known or suspected cirrhosis at baseline or a history of hepatocellular carcinoma, defined by relevant International Classifications of Diseases (ICD) codes prior to or within 1-year after cohort entry (n = 18,650) (Additional file 1: Table S1) or baseline APRI > 2.0 (n = 30,144) [16]. Our final cohort contained 182,772 patients with both HCV and at least two APRI scores who were classified as non-cirrhotic at baseline (Fig. 2). The study was reviewed by the Institutional Review Board of the Ann Arbor VA Healthcare Systems and was granted a waiver of informed consent.

Fig. 1
figure 1

Adapted time-varying covariates model design

Fig. 2
figure 2

Cohort development

Predictor variables

Predictors of interest were selected a priori based on our prior work, biological plausibility, and expert clinician opinion [13,14,15]. Demographic variables included age at cohort entry, sex, race, and Hispanic ethnicity. SVR was modeled as a step function of time whereby the variable value remained 0 until antiviral treatment, at which point it became 1. Laboratory predictors included aspartate aminotransferase (AST), alanine aminotransferase (ALT), AST/ALT ratio, albumin, total bilirubin, creatinine, blood urea nitrogen, glucose, hemoglobin, platelet count, white blood cell count, sodium, potassium, and chloride. INR and total protein were excluded due to a large baseline degree of missingness (50% missing and 17% missing at baseline, respectively). We used all available laboratory data points for each patient. We modeled each longitudinal laboratory predictor using a stepwise function where the value between two consecutive time points was imputed by the lab value measured at the previous time point. Specifically, if we did not have a lab measured at time-zero, we imputed the missing value with the median of the variable of all measured values of all patients at time-zero. After time-zero, any lab values missing during the accrual window (2- or 4-year window) were imputed by the closest last measured value prior to the missing value.

We considered including additional comorbidities such as alcohol use and diabetes in the model. However, in prior work we found that these additional characteristics did not significantly contribute to prediction of cirrhosis after accounting for longitudinal laboratory results (e.g., AST, glucose) already included in our models [15]. In the same earlier study, we systematically evaluated a variety of parameters for body mass index (BMI) (e.g., most recent, minimum, maximum) and found they ranked at or near the bottom of statistical importance relative to the laboratory variables. Therefore, in the current study we report these patient characteristics but limited our models to laboratory data to enhance reproducibility across systems and avoided variables such as “alcohol use” which may be documented inconsistently and depend on the accuracy of patient reporting as well as the definitions and diagnostic criteria used [15].

Outcome variable

We defined our primary outcome, cirrhosis development, as two consecutive APRI scores > 2, as described in previous work by our group [15]. APRI has been previously validated against liver biopsy in patients with HCV and has outstanding discrimination based on area under receiver operating curve (AUROC), with performance similar to transient elastography for detecting cirrhosis [17]. Furthermore, APRI is less sensitive to the effects of age than other non-invasive markers of fibrosis, such as the FIB-4 index, and performs at least as well as FIB-4 in predicting cirrhosis after SVR [18, 19]. The observation period for each patient started on the date of the first recorded APRI and ended with occurrence of cirrhosis or censoring due to death or loss to VHA follow-up.

Statistical analysis

Time-varying covariates Cox model: In classical time-varying covariates Cox models, prediction of an outcome (or “event”) in the future is not possible because computation of the survival probability at a future time would require future knowledge of covariate values. In order to predict future cirrhosis using a time-varying covariates Cox model, we redefined the notion of an event in survival analysis. Traditionally, an event consists of the occurrence of an outcome at the current time. We changed its definition as occurrence of cirrhosis K (1-, 3-, 5-years) years after the accrual time, where K is the length of prediction window to be specified by the user. The hazard function in our time-varying covariates Cox model characterizes the conditional probability that cirrhosis will subsequently develop after K years of additional follow-up, given no previous occurrence of cirrhosis (Fig. 1). Parameters in the model were estimated via maximizing the partial likelihood. We fit time-varying covariates Cox model using Survival R version 3.6.1 package (R Project for Statistical Computing).

Model evaluation

To evaluate the discriminative performance of the model, we used the area under the receiver operating characteristics curve (AUROC) to compare the predicted probability of developing cirrhosis with each patient’s observed outcome (an AUROC of 1.0 represents perfect discrimination). We predicted cirrhosis development in 1-, 3-, and 5-years using laboratory data accrual windows of 2- and 4-years after first APRI. For example, if we had 4-years of lab data, we used the first 2-years of data in the accrual window to make a prediction in the subsequent years. Accrual windows of 2- and 4-years were used as a 2-year time period as an approximate assessment of a new patient trajectory coming into the health system and a 4-year period gives a more longitudinal view to capture long-term changes. Patients censored before 1-, 3-, or 5-years were removed since their true outcomes were not available. We evaluated each prediction setting in terms of specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV).

Training and testing cohorts

We created model training and testing datasets by randomly splitting the sample into 70% and 30% subsets. The random splitting process was performed 30 times to produce a more stable evaluation and to generate confidence intervals. Under each split, the time-varying covariates Cox model was fitted on the training set and evaluated on the testing set. The AUROC measures for all outcome windows and predictor windows were averaged over 30 splits. We report the representative split of training and testing data with the AUROC closest to the average AUROC over 30 splits. The best cut-off was selected by choosing the point on the ROC curve closest to where both sensitivity and specificity equal one. Specifically, we the cut-off where (1-sensitivity) ^2 + (1-specificity) ^2 is minimized.

Sensitivity analysis

For prediction of cirrhosis, we used an APRI cut-off > 2 as our primary outcome to maximize specificity and PPV; however, we also performed a sensitivity analysis using APRI > 1 given variation in thresholds across prior studies.

Results

Cohort characteristics and incidence of outcomes

Table 1 provides summary statistics for patient characteristics and baseline laboratory measurements among individuals with HCV infection and a minimum of 1 (n = 146,182), 3 (n = 110,559), and 5 (n = 84,189) years of follow-up time. Patients were 97% male and majority (51%) white, with a mean age of 52.4 (SD 8.32) years old. Baseline APRI scores were low (mean 0.668 [0.686]), as expected for a non-cirrhotic population. Patients had a mean BMI of 27.4 (SD 5.24). More than half (50.9%) carried a diagnosis of alcohol use disorder and nearly a third carried a diagnosis of diabetes (31.7%). The majority (52%) underwent antiviral treatment between 2000 and 2015 (not including the additional patients treated after 2015). Of 95,630 who received treatment, 80.3% received a DAA and 30.5% received an interferon-based regimen (10.6% received both). Median time to HCV treatment (first diagnosis of HVC infection to the first treatment) of 6.91 years and an aggregated SVR rate of 80.3%. A total of 16.2% (n = 29,566) developed cirrhosis with a median of 4.98 years to cirrhosis development after time zero.

Table 1 Characteristics of patients with a minimum of 1, 3, and 5 years of follow up time

Model performance

We predicted cirrhosis development at 1-, 3-, and 5-years using a laboratory covariate time window of 2 and 4 years, respectively. The average, standard deviation, and 95% confidence intervals for AUROC over 30 random splits are summarized in Table 2. The misclassification results for all 6 combinations of outcome prediction windows and covariate time windows are shown in Table 3. To investigate the effect and significance of each predictor, we fit the 1-, 3- and 5-year outcome prediction model on the full cohort of data. The summary of model fitting is shown in Additional file 1: Tables S2–S4. The p values in the summary table reflect the significance of each variable’s longitudinal trajectory values in predicting cirrhosis. Cirrhosis predictors such as AST and platelets had extremely small p values (< 0.0001). SVR was highly significant in explaining cirrhosis outcomes.

Table 2 AUROC for cirrhosis prediction after 1-, 3-, and 5-years of follow up using 2- and 4-year laboratory data windows
Table 3 Misclassification table

Sensitivity analysis

The average AUROC using APRI > 2 gives 0.815 (95% CI 0.813–0.817) and using APRI > 1 gives 0.708 (95% CI 0.706–0.710) (for 1 year prediction model on 2-year lab accrual window evaluation).

Discussion

A Cox model using time-varying covariates and a flexible time accrual window for longitudinal laboratory data achieved excellent discrimination for cirrhosis prediction at 1-, 3-, and 5-years among patients with HCV. Our study is the first to successfully use a large administrative dataset with a time-varying covariates model to predict future cirrhosis outcomes in HCV patients with and without SVR. This approach achieved high AUROCs for predicting the development of cirrhosis, as assessed by serial APRI score, and performed well at up to five years compared to previous models that were limited by fixed laboratory covariates and shorter follow up time [15].

We developed a novel approach to prediction by transforming longitudinal laboratory variables into time-varying covariates, allowing us to use each patient’s full spectrum of laboratory data instead of reducing the laboratory data to summary values. Unlike earlier models constructed exclusively for patients with viremic HCV, we included antiviral treatment as a time-varying covariate. Our model is therefore generalizable to both treated and viremic patients with HCV. All six combinations of laboratory data windows (2-or 4-years) and cirrhosis prediction windows (1-, 3-, or 5-years) produced excellent AUROCs. Taken together, our method accurately predicted risk of cirrhosis without inducing obvious bias due to the selection of the prediction window length.

Our study benefited from a very large HCV population drawn from the VHA healthcare network, which oversees the largest single cohort of patients with HCV in the US. We had access to comprehensive laboratory, demographic, and pharmacy data for all patients. VHA users tend to be older and more likely to be male than the general US population, so results should be extrapolated cautiously to other cohorts. Our conclusions are tempered by the use of a laboratory surrogate (two consecutive APRI scores > 2) to mark the development of cirrhosis rather than liver biopsy or transient elastography results, though prior studies have confirmed APRI as an excellent surrogate for biopsy-proven cirrhosis [17]. We selected this method due unknown validity of transient elastography values after HCV treatment, and the small proportion undergoing serial liver biopsy after antiviral therapy. In addition, we sought a surrogate cirrhosis endpoint that would be practical for others to replicate in administrative datasets and in resource-limited settings. Nevertheless, although APRI is considered a reliable laboratory marker of cirrhosis, a small amount of cirrhosis misclassification likely occurred. As a linear model, the time-varying covariates Cox model can only reflect a linear effect between the predictors and the outcome and therefore may not fully represent a non-linear relationship. We note that approximately 30% of the treated patients in our cohort received an interferon-based regimen due to the time period involved. Though such regimens are obsolete, there is no scientific reason to suspect that the type of regimen used would alter the risk of subsequent cirrhosis development after SVR or change the conclusions of the study. Finally, our data sources lacked results for laboratory testing or antiviral treatment conducted outside the VHA system. This model may not be generalized to non-Veteran populations and future external validation studies are needed to assess performance.

Conclusions

Our model has many potential applications for predicting cirrhosis given the expanding population of patients with HCV now achieving SVR after antiviral treatment. For example, as more HCV patients successfully achieve SVR, practitioners will need tools to identify those at continued risk for cirrhosis despite antiviral therapy. Incorporating predictive models into HCV registries or other population-based systems may serve to identify patients who require continued specialty care and disease monitoring after HCV eradication. Furthermore, health care systems could also use cirrhosis prediction tools to estimate and prepare for the future burden of disease among persons with HCV, with and without treatment. Our novel time-varying covariates Cox model provides an accurate method for predicting cirrhosis that improves upon earlier models and can be applied at scale in large administrative datasets using widely available laboratory markers.