Background

Liver cancer is the seventh most common malignant tumor and the second leading cause of death among all cancers worldwide [1]. It is one of the five leading causes of death in the Chinese population, among which hepatocellular carcinoma (HCC) accounts for more than 85–90%. The development of HCC is closely related to chronic liver injury of any etiology. Early HCC detection is of paramount importance for improving prognosis due to the lack of a specific clinical presentation [2]. Combined testing of tumor markers can increase the sensitivity without reducing the specificity of the diagnosis [3]. Scientific risk-based stratification is a key means of improving the overall survival rate [4].

Increasing numbers of models have been derived to predict the risk of HCC, which can then be used to stratify patients. Some models require unique indicators (e.g., genetic testing or DNA level) that make data collection difficult or are limited to a certain group of people, such as those with viral infections, those taking antivirals, and those with a specific alpha-fetoprotein (AFP) level, substantially limiting the external verification and universality of such models [5,6,7]. An easy-to-use, international prediction model is urgently needed to guide the personalized management of populations at risk for HCC.

Recently, a simple multicenter collaborative and verified model for the prediction of HCC in hepatitis B virus (HBV)-infected patients was reported and had promising discrimination ability. The ASAP model includes four factors: age, sex, AFP level, and prothrombin induced by vitamin K absence-II (PIVKA-II) level. The online calculator can predict the probability of target patients developing HCC and classify them into three risk groups. The accompanying surveillance decisions include the recommendation that low-risk individuals do not need to undergo imaging at 6-month intervals [8]. Moreover, a retrospective case-control study reported that the best model containing the same four factors improved the accuracy of the detection of early-stage HCC in patients with viral or nonviral chronic liver disease predominantly in populations of Caucasian and African American descent [9]. However, this model has not been validated in real-world clinical practice. When prediction models are validated in new circumstances, due to differences in the case mix and model predictor effects, miscalibration is common, leading to reduced utility [10, 11]. Therefore, comprehensive model validation, including an assessment of the discrimination ability, calibration, and clinical application, is indispensable.

This study was performed to validate and update the ASAP model in patients with viral or nonviral chronic hepatitis and cirrhosis in daily clinical practice and explore the appropriate thresholds for the stratification of risk groups that can be used to inform the selection of health management strategies.

Methods

Data collection

This retrospective study was performed in an academic tertiary hospital, Hubei Provincial Hospital of Traditional Chinese Medicine in Central China, from May 2018 to January 2021. Data were extracted from the electronic medical records. Records from consecutive inpatients with both AFP and PIVKA-II results were reviewed. Data from patients older than 35 years with hepatitis, cirrhosis, hepatic benign space-occupying lesions (SOLs), or HCC before October 2019 were included.

The levels of AFP and PIVKA-II were measured with the appropriate testing systems from Roche Diagnostics (Shanghai, China) and Fujirebio Diagnostics (Fujirebio, Japan), respectively. Because the levels of AFP and PIVKA-II can rapidly increase with the onset of HCC, we defined three weeks as the longest time from the marker measurements to HCC diagnosis [12,13,14] and selected the first detected values to validate and update the ASAP model. Individuals with undiagnosed HCC beyond that interval and those without HCC were considered at-risk individuals in the subsequent follow-up analysis. The outcome was the definitive diagnosis of HCC. The follow-up duration was calculated from the time of the first measurement to the date of the conclusive HCC diagnosis or January 31, 2021. Data were censored at the time of loss to follow-up, the time of non-HCC-related death, and the end of the study.

The diagnosis of HCC was refuted or confirmed based on a comprehensive reference standard, and both a clinical diagnosis and pathological diagnosis were made based on spiral computed tomography, magnetic resonance imaging, and liver biopsy [4].

Missing data

There were no missing data for AFP or PIVKA-II.

Statistical analysis

The levels of AFP and PIVKA-II were log transformed because they had right-skewed distributions.

Based on the published coefficient values in the development cohort [8], we calculated the linear predictor of each patient in the validation cohort as an offset variable with the coefficient set to 1 and intercept set to 0 to validate the ASAP model. The linear predictor was calculated as follows: − 7.57711770 + 0.04666357[age] − 0.57611693[sex] + 0.42243533[log(AFP)] + 1.10518910[log(PIVKA-II)] [8]. We evaluated its discrimination, calibration, and clinical benefit and then recalibrated and updated it based on three logistic regression models [15]. Recalibration-in-the-large included the ASAP linear predictor with coefficient set to 1 and the assessed intercept. In recalibration, we evaluated the intercept and the coefficient of the ASAP predictor. In model revision, we kept all variables in the original ASAP model and refitted it. We used a bootstrap resampling procedure in the complete updated samples to correct for optimism when model revision was applied. The overall performance was evaluated by the Brier score and compared among models by the chi-square test. A lower Brier score indicates better performance or closer to being correct. The final updated model was selected according to a previously reported procedure as follows. If the test of the model revision against the original ASAP model was not significant, we adopted the original ASAP model; otherwise, we continued. If the test of the model revision against recalibration-in-the-large was not significant, we adopted the updated model intercept; otherwise, we continued. If the test of the model revision against recalibration was not significant, we adopted the recalibration model; otherwise, we adopted the revised model [15].

Discrimination was measured using the C-statistic. Calibration was assessed with calibration plots. The classification accuracy performances were measured using the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) at a selection of thresholds. The net benefit (NB) of the model was evaluated with decision curve analysis.

We chose the median prediction value of the non-HCC cases in the final updated model and the HCC event rate in the validation cohort from a list of common thresholds to classify the patients into low-, medium-, and high-risk groups [10].

The differences in the relative risk ratios between two risk groups were compared by the chi-square test. The cumulative HCC incidences of the different risk groups without HCC were obtained by the Kaplan-Meier method, and their discrepancies were compared by the log-rank test.

The model established in our study complied with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines [12] (see Additional file 1: Table S1).

A p-value less than 0.05 was considered statistically significant. We carried out statistical analyses using packages in R software (v3.6.2 and v4.0.2; R Foundation for Statistical Computing).

Results

Participant selection and characteristics

In total, 3726 records of 2479 participants were reviewed; 1247 records were repeated measurements and, thus excluded. In total, 2479 patients with the first detection values of biomarkers were further selected. Furthermore, 1467 individuals were excluded for the following reasons: 18 patients treated with warfarin, vitamin K, vitamin K antagonist, or antibiotics that alter the gut flora; 320 non-HCC cancers; 232 HCC patients treated with antitumor agents; 83 health examiners and 729 patients without cancers or liver diseases; and 85 patients with hepatitis aged under 35 years. Ultimately, 1012 qualified patients were included in the model validation process. After excluding 157 confirmed HCC cases, 855 at-risk patients were included in the continued follow-up analysis to compare the outcomes across the different risk groups (Fig. 1). The validation cohort was dominated by single HBV-infected patients (77.3%), who had an HCC incidence rate of 15.6% (122/782), while the incidence in patients with other etiologies was 15.2% (35/230). There was no significant difference between the two incidences (χ2 p-value = 0.967). The overall HCC incidence in the validation cohort was 15.5%, which was significantly lower than that in the development cohort (41.1%) reported by Yang et al while deriving the original ASAP model [8]. Table 1 shows the basic characteristics of the two cohorts. Due to the different patient selection criteria and data collection methods from respective distinct clinical settings, there were significant differences in age, etiology, case composition, Child-Pugh class, bilirubin, alanine aminotransferase, and albumin between the development and validation cohorts. The AFP and PIVKA-II serum levels are shown in Additional file 2: Fig. S1a, b.

Fig. 1
figure 1

Analysis flowchart. Abbreviations: AFP, alpha-fetoprotein; HCC, hepatocellular carcinoma; PIVKA-II, prothrombin induced by vitamin K absence-II; SOL, space-occupying lesion

Table 1 Baseline characteristics of the study participants compared with the original development cohort

External model validation

Table 2 shows the characteristics and performance of the models considered in this paper. The models contained four predictor factors: age (years, in 1-year increments), sex (male = 0, female = 1), AFP (log ng/mL), and PIVKA-II (log mAU/mL). bage, bsex, bAFP, and bPIVKA-II were the regression coefficients that indicated how a patient’s values of the predictor factors affected the risk of HCC. Regarding bage, which was 0.04666357 in the original ASAP model [8], the odds ratio of age was exp(bage), i.e., 1.048. The increased risk of HCC among the patients was 0.048 per year after adjusting for other factors. Regarding the recalibration-in-the-large, including the ASAP linear predictor with the coefficient set to 1, bage was the same as that of the ASAP model. Regarding the recalibration assessed with a slope of 1.192, bage was 0.0556229754, equal to 0.04666357 multiplied by 1.192. In the model revision, we refitted it with our data, and the estimate of bage was 0.06178. Therefore, different models generated different bage values. The same was true for the diverse regression coefficients of other factors in different models. The negative bsex illustrated that the female patients had a lower risk of HCC than the reference group male patients after adjusting for age, AFP, and PIVKA-II. In the recalibration-in-the-large, the bsex was − 0.57611693. The odds ratio of sex was 0.562. Female patients had a 43.8% lower risk of HCC than male patients after adjusting for other variables in the model.

Table. 2 Characteristics and performance of the original and updated ASAP diagnostic prediction models

The ASAP model had a high diagnostic accuracy {C-statistic = 0.982, [95% confidence interval (CI), 0.972–0.992]} but significantly overestimated the risk of HCC (intercept − 3.243 and slope 1.192 in the calibration plot) (Table 2, Fig. 2a) and had poor clinical utility at the probability thresholds of 1/3 and 2/3 set in the model [8] [with respective NBs of 0.011 (95% CI − 0.017 to 0.038) and − 0.019 (95% CI, − 0.061 to 0.025) in the decision curve analysis] (Fig. 2c).

Fig. 2
figure 2

Calibration and decision curves of the ASAP model and the updated models predicting HCC risk. Calibration curves of the predicted probabilities versus the observed probabilities of (a) the original ASAP model reported by Yang et al [8]; and (b) recalibration-in-the-large developed based on the method described by Vergouwe et al [15] in the validation cohort (n = 1012). A nonparametric calibration curve with the 95% confidence limits (CL) (red slide line with dashed lines) was created with the Loess algorithm. Observed HCC occurrence (green triangles) with the 95% CL was plotted against the average predicted probability in each group. The blue straight diagonal line serves as a reference for perfect calibration. The brown bar chart at the bottom of the figure presents the distribution of the predicted probabilities of the cases with outcomes (above the line) and those without outcomes (below the line) (“1” vs. “0”). (c) Decision curves showing the net benefit correlated with the utility of the original ASAP model reported by Yang et al [8] and the updated models (recalibration-in-the-large, recalibration, and model revision) derived using the methods described by Vergouwe et al [15]. Notably, the coefficients of the models are listed in Table 2. The logit (P) calculation formula of the ASAP model was {− 7.57711770 + 0.04666357[age] - 0.57611693[sex]+ 0.42243533[log(AFP)] + 1.10518910[log(PIVKA-II)]}; the formula of recalibration-in-the-large was {-10.82011770 + 0.04666357[age] − 0.57611693[sex] + 0.42243533[log(AFP)] + 1.10518910[log(PIVKA-II)]}; the formula of recalibration was {− 12.60892 + 0.0556229754[age] − 0.6867313806[sex] + 0.5035429134[log(AFP)] + 1.3173854072[log(PIVKA-II)]}; and the formula of the model revision was {-12.95521 + 0.06178[age] − 0.7449[sex]+ 0.55643[log(AFP)] + 1.28268[log(PIVKA-II)]}. Abbreviations: AFP, alpha-fetoprotein; CL, confidence limits; HCC, hepatocellular carcinoma; Loess: locally weighted linear regression; PIVKA-II, prothrombin induced by vitamin K absence-II

Recalibration was necessary due to the apparent overfitting of the ASAP model with the new data (Table 2, Fig. 2a). The three updated models had the same excellent discrimination ability (C-statistic = 0.982), and their calibration was superior to that of the ASAP model, as they had nearly zero intercepts. The comparison of the model revision against the ASAP model was significant [likelihood ratio test (LRT) χ2 p-value < 0.001]. Recalibration-in-the-large revealed an intercept of − 3.243 and resulted in a significant improvement in the model fit compared with that of the ASAP model [LRT χ2 p-value = 0.04384 (data not shown)], with a recalibration slope of 1.192, and did not significantly differ from the revised model (LRT χ2 p-value = 0.332). A nonparametric calibration plot using the locally weighted linear regression (Loess) function is shown in Fig. 2b. Recalibration revealed an intercept of − 3.577 and a slope of 1.192. Recalibration further improved the Brier score and average absolute difference in the predicted and calibrated probabilities (Eavg) [16] and did not significantly differ from the revised model (LRT χ2 p-value = 0.913), with a recalibration slope close to 1. Calibration curves of the recalibration and model revision are shown in Additional file 3: Fig. S2a, b. The model revision improved the calibration the most, with the smallest Brier score and Eavg (Table 2). The bootstrap resampling procedure was used to validate the revised model as there was some overfitting with an optimism-corrected C-statistic of 0.981 and a shrinkage factor of 0.976. The nomogram and score tables of the re-estimated ASAP model are shown in Additional file 4: Fig. S3a and Additional file 5: Table S2, respectively.

Based on the closed testing procedure [15], the ASAP model significantly differed from the model revision (LRT χ2 p-value < 0.001), but recalibration-in-the-large showed no significant difference (LRT χ2 p-value = 0.332); recalibration-in-the-large was adopted for the final updated model by adding an intercept of − 3.243 to the ASAP linear predictor (Table 2).

In addition, the decisive curve of recalibration-in-the-large nearly completely overlapped with those of recalibration and model revision, with high NBs over the original model within the entire threshold range (Fig. 2c). At the 1/3 risk threshold, it increased the NB by 0.115 over the ASAP model, representing 11.5 more HCC cases identified per 100 patients for the same number of unnecessary interventions (Table 3) [17]. It is relatively more beneficial for clinicians to use the updated model to assess at-risk individuals and improve clinical decision-making.

Table. 3 Classification statistics of the selected thresholds for the recalibration-in-the-large of the validation cohort and subsets

Performance of recalibration-in-the-large among different etiologies

Table 3 lists the proportions of overall and HCC patients divided by the selected thresholds of recalibration-in-the-large and the statistical characteristics. The thresholds included the optimal cutoff value (13.1%), the incidence in the validation cohort (15.5%), the midpoint of the sigmoid curve (50%), the tierce-points of the sigmoid curve (1/3, 2/3) set in the original ASAP model [8], the median prediction probability in the non-HCC cases (1.3%), and the median prediction probability in the HCC cases (98.3%). All cutoff values were commonly adopted in the model performance evaluation. Lower thresholds resulted in higher sensitivities, lower specificities and higher NBs in the recalibration-in-the-large, and vice versa.

Recalibration-in-the-large functioned slightly better in patients with single HBV infections, with a C-statistic of 0.990 (95% CI 0.984–0.997), than in individuals with other etiologies, with a C-statistic of 0.943 (95% CI 0.894–0.992) (Z = 1.879, p-value = 0.060). From probability thresholds of 0 to 0.7, the single HBV-infected group presented consistently higher NBs than the remaining group (Table 3), equivalent to detecting more HCC cases without increasing the number of false positives [17].

Risk threshold selection and risk stratification

Risk stratification is the cornerstone of the individualized management of patients at risk for HCC. However, an optimal risk threshold does not exist for a prediction model. Reasonable risk thresholds should be selected based on the actual clinical setting. Considering the severity of the disease, low thresholds that are less than 50% and have high NPVs are usually preferred [10, 18]. Similar to the distribution of the predicted probabilities of HCC and non-HCC patients in the model revision (Additional file 4: Fig. S3b), the recalibration-in-the-large achieved a sensitivity of 100% and an NPV of 100% at a probability threshold of 1.3% and a specificity of 100% and a PPV of 100% at 98.3% with regard to the diagnosis of HCC regardless of etiology in the validation cohort (Table 3). From the listed thresholds, we selected probability thresholds of 1.3% and 15.5% to stratify the validation cohort into low-, medium-, and high-risk groups. As a result, 427 non-HCC patients were classified into the low-risk group; 183 patients, including 142 HCC cases, were classified into the high-risk group; and the other 402 patients, including 15 HCC cases, were sorted into the medium-risk group. The incidence rates in the three groups were 0% (0/427), 3.7% (15/402), and 77.6% (142/183). Taking the medium-risk group as the reference, the relative risk (RR) ratios and 95% CIs of incident HCC were 0.0 in the low-risk group and 20.8 (11.9–36.4) in the high-risk group.

Clinical outcome assessment after risk stratification

After excluding HCC cases with a confirmative diagnosis, the remaining 855 at-risk individuals were continuously observed for incident HCC during a median of 10.2 months of follow-up. The cumulative HCC incidences were significantly different in the low-, medium-, and high-risk groups (log-rank test, p-value < 0.001) (Fig. 3). The low-risk group maintained the lowest incidence over a period of up to two years. The 3-month, 6-month and 18-month cumulative incidences were 0.6%, 0.9% and 0.9% in the low-risk group; 2.0%, 3.6% and 5.8% in the medium-risk group; and 4.2%, 9.2% and 22.2% in the high-risk group.

Fig. 3
figure 3

Cumulative HCC incidences in three risk groups based on the recalibration-in-the-large. Kaplan-Meier curve demonstrating significant differences in the cumulative HCC incidences among the low-risk group (predicted probability < 1.3%, the median of the predicted probability in the non-HCC patients), medium-risk group (predicted probability 1.3–15.5%), and high-risk group (predicted probability > 15.5%, the HCC incidence in the validation cohort) (log-rank test, p-value = 9e−04; low-risk vs. medium-risk with p-value = 0.01, medium-risk vs. high-risk with p-value = 0.08, and low-risk vs. high-risk with p-value = 3e−05). The two probability thresholds were reasonable for risk classification based on the estimation by recalibration-in-the-large in the validation cohort. Abbreviations: HCC, hepatocellular carcinoma

These results showed that recalibration-in-the-large, combined with reasonable thresholds of 1.3% and 15.5%, could be used to effectively stratify patients based on their need for HCC surveillance into three risk groups that are useful in everyday clinical practice. Patients in the high-risk group are likely to be diagnosed with HCC soon and need fairly intensive surveillance. Patients with uncertain liver SOL in any risk group should undergo close monitoring at 3-month intervals. Patients with definitely benign or absent liver SOL in the low-risk group could undergo monitoring every year, and similar patients in the medium-risk group should return for surveillance at 6-month or shorter intervals.

Discussion

This study validated the ASAP model [8] in a significantly different setting from that in which it was developed. In this setting, the patients underwent all-cause surveillance for HCC (Table 1). The model was highly miscalibrated, and its uncertainties were corrected using three methods. By comparing the overall performance of these models, the recalibration-in-the-large model was used for the prediction of incident HCC, and two reasonable thresholds were set for the classification management of at-risk patients.

Although the original model was significantly more discriminative in the validation cohort than in the development cohort [C-statistic, 0.982 (95% CI 0.972–0.992) vs. 0.941 (95% CI 0.929–0.952)] [8], it systematically overestimated the probability of HCC (Fig. 2a), resulting in a reduced clinical benefit and possibly even harm (Fig. 2c), including more radiation exposure, more psychological harm, and higher costs [10, 11, 19,20,21]. The main reason may be the difference in case mix in the two cohorts [11]; the average increased prediction probability of 25.6% was the difference between the two HCC incidences (41.1% and 15.5%).

The LRT between the original model and model revision (p-value < 0.001) highlighted the need for recalibration [15]. Recalibration-in-the-large yielded comparable performance to model revision (LRT χ2 p-value = 0.332), e.g., the same excellent discrimination, good calibration, and increased NB across a wide range of probability thresholds (Fig. 2b, c), avoiding the excessive and unnecessary measurements that would be undertaken if the original model was used. Additionally, the updated model had superior accuracy and clinical usefulness in patients at risk for HCC with single HBV infections than in those with other etiologies.

The hierarchy of the quality of the model external verification using new data from an independent research institution is better than temporal verification and geographic verification using data collected by the original research team for model development. Model verification with new data might lead to decreased accuracy and miscalibration and might influence the model outcome. However, we could adjust the overfitting, improve the model's calibration and overall prediction performance, and use the model in diverse populations and settings [22, 23]. Our data were collected from a different population and clinical scenario from that of the development cohort. We corrected the calibration by adding an intercept of − 3.243 to the original ASAP model, thus promoting its generalizability. Our research provided a robust verification of the ASAP model.

Accurate predictions and precise interventions have received increasing attention. Personalization surveillance schemes are expected to benefit patients by facilitating the early detection of HCC, while risk-adapted interventions are expected to prevent or delay the progression of HCC and improve patient prognosis [4, 24, 25]. The absence of valid noninvasive risk stratification tools makes the personalized management of at-risk patients challenging. The risk thresholds in a prediction model are commonly determined arbitrarily [10] and seldom further tested with regard to their impact on clinical outcomes [7, 26]. Proper thresholds can benefit clinical management strategies, while inappropriate thresholds negatively affect decision-making processes [10]. The thresholds set for the online ASAP nomogram, 1/3 and 2/3, were somewhat unsuitable for the recalibration-in-the-large to risk stratification in clinical contexts similar to ours, which led to poor predictive results at the follow-up outcome evaluation, i.e., most patients were in the low-risk group, there were fewer medium-risk patients than high-risk patients (32 vs. 124), and there was a higher cumulative HCC incidence rate in the low-risk group than in the medium-risk group (see Additional file 6: Fig. S4). The choice of risk thresholds, discrimination, calibration, and clinical utility need to be reevaluated before its application in a new clinical setting. We assessed the performance of recalibration-in-the-large at multiple probability thresholds (Table 3) [10, 26] and chose a relatively stable median value of non-HCC cases and a less harmful event rate to group patients according to their level of risk [11, 18] with reference to the distribution characteristics of the predictive values in the study cohort (see Additional file 4: Fig. S3b). The probability thresholds of 1.3% and 15.5% reasonably and effectively defined the risk categories (Table 3 and Fig. 3). Thus, the prediction model laid a foundation for the further clinical application of risk stratification to better manage targeted individuals [26, 27].

Routine semiannual follow-up monitoring in at-risk patients was expected to improve the detection of early-stage HCC and the clinical outcomes [3, 20, 28, 29]. Low-risk populations with an annual incidence rate of HCC < 0.2% can be exempted from surveillance [28, 30, 31]. However, current studies have shown that the clinical effects are unclear for various reasons [3, 20]. First, the limited compliance of patients was not satisfactory [4]. Second, the one-size-fits-all recommendation was unlikely to be applicable to all patients in multiple settings [32]. Recently, Rich NE et al reported that tumor growth has obvious heterogeneity, and the tumor doubling time varies from less than 3 months to more than half a year [33]. 3-month diagnostic delays can allow tumors to grow significantly, leading to less effective treatment [34]. The expert consensus in East Asia also pointed out that HBV-infected patients with two risk factors for HCC development need to initiate antiviral therapy or undergo monitoring at 3-month intervals [25]. Third, the 6-month or unclear detection time window in most HCC risk prediction models leads to decreased predictive performance of models based on tumor markers that fluctuate over time [14] and ineffective suggestions regarding the need for 3-month follow-up visits. Our follow-up outcome analysis showed that among the low-risk group, which had an HCC diagnosis rate of 0%, at least 0.6% of the individuals should return at 3 months to facilitate the discrimination of very early-stage HCC from atypical hyperplastic nodules. Three-month follow-up strategies developed for the small proportion of the population with nodules of uncertain identity will facilitate the early diagnosis of HCC. The evaluation of the clinical impact of predictive models helps refine the follow-up strategies for different risk groups.

Routine serum biomarkers are often helpful in assessing the HCC risk of patients. Despite its limited accuracy, AFP is the most commonly used HCC diagnostic biomarker due to its increase in the early stages of the disease. PIVKA-II is another widely used HCC marker with high specificity. PIVKA-II is a powerful supplement to AFP in HCC diagnosis [35], and their combination could improve the accuracy [35,36,37,38]. Recent reports have suggested that the ratio of gamma-glutamyltransferase to aspartate aminotransferase (γ-GT/AST) and the Lens culinaris agglutinin-reactive fraction of AFP (AFP-L3) are valuable markers in HCC diagnosis. The addition of the γ-GT/AST ratio or AFP-L3 to the combination of AFP and PIVKA-II could further improve the accuracy of HCC diagnosis [36, 37]. Our data show that the γ-GT/AST ratio of the patients with HCC was significantly higher than that of the other risk groups (p-value < 0.05) (Table 1). However, when combined with AFP and PIVKA-II, the γ-GT/AST ratio did not significantly contribute to the logistic regression models predicting HCC (p-value = 0.835) or early HCC (p-value = 0.716). AFP-L3 has been reported to be a sensitive indicator for early HCC detection, and its diagnostic performance was inferior to AFP and PIVKA-II [37, 38]. Considering its cost-effectiveness and relatively lower performance in HCC prediction, few studies have reported the effectiveness of AFP-L3 combined with AFP and PIVKA-II in the diagnosis of overall and early HCC [8, 37]. In the existing reports, the increment value of the C-statistic of AFP-L3 to the model of AFP and PIVKA-II remains controversial [37, 38].

Our analysis showed that HCC occurrence was most closely related to hepatitis B virus infection; the incidences of HCC in populations with different etiologies were similar (p-value > 0.05). The simpler ASAP model that only included the AFP and PIVKA-II levels and two patient observable attributes, such as age and sex [8], did not need to detect other indexes, was not limited by the etiologies of liver disease and the antiviral status, could accurately predict the risk of HCC and was easily validated and calibrated with new data from diverse settings.

Our research is the first to apply the HCC diagnostic model as a prognostic management tool to the population at risk of HCC. This model efficiently excavated the potential predictive function of the ASAP diagnostic model. This model helped make clinical follow-up decisions for different risk populations, which could achieve long-term and refined management of patients at risk for HCC. Based on the risk thresholds selected based on the population distribution characteristics, high-risk groups could receive more attention and monitoring, and the allocation of medical resources could be optimized, which could be beneficial for the early detection, intervention, diagnosis, and treatment of HCC. This model could improve the prognosis of at-risk patients for HCC. Moreover, these indicators have the characteristics of noninvasiveness, easy availability, repeatability, and standardization, which are conducive to applying and promoting the model in clinical practice.

Strength

We determined three weeks as the test-to-diagnosis interval based on the clinical characteristics of HCC diagnosis [12, 13], which increased the performance of the model and provided evidence supporting the development of 3-month follow-up strategies and patient counseling. We improved the applicability of the ASAP model to patients at risk for HCC regardless of etiology by updating it based on data from a real-world cohort. We clarified that stratified management with reasonable thresholds had the potential to aid in achieving an early diagnosis and avoiding clinical harm. We refined the personalized recommendations for different risk groups through a follow-up analysis, e.g., definitively low-risk patients can prolong their follow-up intervals to one year, avoiding unnecessary investigations, enabling the allocation of limited medical resources to the most high-risk patients, which is an especially important consideration during the coronavirus disease 2019 pandemic.

Limitations

A total of 18.8% (161/855) of the patients were lost to follow-up, which might have resulted in inaccurate estimates concerning the associations between HCC occurrence and the predicted risk. The current study focused on the prediction of the risk of all-cause HCC; however, the risk of HCC differs by etiology [3, 4, 28, 29]. It would be better to generate recalibrated models for each specific etiology, which would necessitate a larger sample size. The models appear to be unable to identify the risk of very early-stage cases due to their relatively low proportion in the study and the large overlap in the AFP and PIVKA-II levels between early-stage HCC and the risk groups (Additional file 2: Fig. S1a, b). Nevertheless, the addition of an appropriate risk classification strategy to the high discrimination ability of the predictive model could effectively improve the status quo. The classification of patients with indeterminate nodules into the high-risk group, thereby ensuring that these patients undergo intensive surveillance, may address this shortcoming [4]. Validation was performed at a single center with limited data, and no cost-benefit analysis was performed.

Conclusion

The ASAP model can be used to accurately predict incident HCC and facilitate the personalized management of at-risk patients. However, independent and robust model validation is essential to ensure that the appropriate tools are available for the precise prediction of incident HCC, which can be used during patient consultation and intervention decision-making in novel clinical settings. The well-calibrated model obtained with recalibration-in-the-large combined with the risk-stratified strategies established in the article is suitable for the noninvasive monitoring of patients at risk for HCC in primary care centers. The overall performance needs further validation and refinement in larger cohorts. Developing region-specific recalibration models and selecting the corresponding risk thresholds may improve the generalizability of the model.