Introduction

Hepatitis B virus (HBV) infection has become a major public health threat for its high prevalence (attacking 257 million people worldwide in 2016) [1]. The major complications of CHB include cirrhosis and hepatocellular carcinoma, leading to poor prognosis [2]. Chronic hepatitis B (CHB) is highly endemic in China, with over 74 million hepatitis B surface antigen (HBsAg)-positive patients [2, 3]. The number of CHB patients undergoing antiviral treatment remains uncalculated [4]. To control the spread of CHB in China, it is essential to conduct early diagnosis and intervention of HBV infection.

Fibrosis staging, an approach to assess HBV-induced liver diseases, is efficient to estimate the prognosis of patients and identify those requiring antiviral treatment [5]. Liver biopsy is traditionally recommended as a standard for staging fibrosis [6], but it is restricted with by invasiveness, cost [7, 8], and unavoidable errors from sampling [9, 10]. Therefore, a variety of noninvasive tests have been developed in recent years.

As summarized in EASL-ALEH clinical practice guidelines [11], noninvasive staging usually depends on serum biomarkers-based mathematic calculation and elasticity-based imaging techniques, such as transient elastography (TE) and magnetic resonance elastography (MRE). Although several strategies combining TE and computer algorithm are introduced in the guidelines, they are only applicable for patients infected with hepatitis C virus (HCV). Moreover, no measurements or macro characteristics of imaging methods have been described in strategies.

With machine learning that can tease out the complex, non-linear relationships in the data [12, 13], we conducted a retrospective multicenter study and established a novel multivariate algorithmic model, named FibroBox, in a cohort of CHB patients in Huaian and Jilin, and then evaluated its predictive accuracy in external validation sets from Anhui and Beijing.

Methods

Patients

We selected 1843 treatment-naïve CHB patients who underwent liver biopsy, blood test, B-ultrasound examination and Fibroscan (FS402, Echosens, France) at four centers, including Huai’an Fourth People’s Hospital (Huai’an, China) (June 2010–October 2017), Beijing You-An Hospital (Beijing, China) (December 2013–April 2017), Hepatology Hospital of Jilin Province (Jilin, China) (July 2008–October 2016) and The First Affiliated Hospital of Anhui University of Chinese Medicine (Anhui, China) (February 2012–November 2017). Their clinical data were retrospectively collected through hospital information system. Included were those who underwent liver biopsy and at least one of the following criteria: aspartic transaminase (AST) or alanine transaminase (ALT) ≥40 IU/L, liver stiffness ≥6.5 kPa, HBV DNA ≥2000 IU/mL or family history of liver diseases. The exclusion criteria included co-infection with HCV, hepatitis D virus (HDV) or human immunodeficiency virus (HIV), focal hepatic lesion (e.g. HCC, hepatic tuberculosis and any other), significant alcohol intake (> 20 g/day), severe hepatic failure (complications such as jaundice and ascites or transaminases level over 10 times the upper limit of normal (ULN)), acute heart failure and pregnancy and BMI greater than 30 kg/m2.

Liver biopsy

Percutaneous liver biopsy (LB) was performed under the ultrasonic guidance by experienced ultrasonologists. Liver samples were formalin-fixed and paraffin-embedded for subsequent histological analysis. Histological analysis was performed by three senior pathologists in every center. If three different results came from one sample, the consensus was taken as the final decision. Liver samples with less than three portal tracts were considered as poor quality and excluded from the analysis. All the pathologists were blinded to the clinical information. The liver fibrosis was staged by the Metavir system [14]. F ≥ 2 was considered as significant fibrosis and F4 as cirrhosis.

Transient elastography (Fibroscan)

All liver stiffness measurements (LSMs) were performed using Fibroscan devices (FS402, Echosens, France) by skilled technicians according to the manufacturer’s protocol [15]. The TE results were presented as kilopascal (kPa). For each patient, the median of 10 successfully measured TE values was regarded as the final TE. A measurement was considered invalid if its TE median > 7.1 kPa and interquartile ratio (IQR)/LSM > 0.30 [16].

Traditional serum index calculation

Aspartate transaminase (AST)-to-platelet ratio index (APRI) [17] and the fibrosis-4 (FIB-4) [18] are two common compound surrogates that use simple formulas to score easily acquired parameters. The formulas of APRI and FIB-4 were shown as follows:

$$ \mathrm{APRI}=\frac{\left(\mathrm{AST}\left(\mathrm{IU}/\mathrm{L}\right)/\mathrm{ULN}\right)\times 100\ }{\mathrm{Platelet}\ \mathrm{count}\ \left({10}^9/\mathrm{L}\right)} $$
$$ \mathrm{FIB}-4=\frac{\mathrm{age}\left(\mathrm{years}\right)\times \mathrm{AST}\left(\mathrm{IU}/\mathrm{L}\right)\kern0.5em }{\mathrm{Platelet}\ \mathrm{count}\ \left({10}^9/\mathrm{L}\right)\times \mathrm{ALT}\left(\mathrm{IU}/\mathrm{L}\right)\hat{\mkern6mu} 1/2} $$

These relevant input parameters were measured when patients were admitted to the hospitals without any interventions.

Ultrasonic measurement

In this study, the parameters measured during ultrasonic examinations included the size of spleen (mm2, length × thickness), the diameter of splenic vein (mm) and the diameter of portal vein (mm). Every parameter was measured for at least three times by experienced ultrasonologists and the mean value was calculated as the final score of each measurement.

Training sets

Two training data sets of treatment-naïve HBV-infected patients who entirely met the study criteria from Huai’an and Jilin (n = 549) were subjected to the algorithmic model (FibroBox). The sets were not absolutely comparable, but the mode could normalize these sets.

Validation sets

The diagnostic performances of the FibroBox and other noninvasive markers were evaluated with external validation sets from Anhui and Beijing cohorts. In the Anhui (n = 408) and Beijing cohorts (n = 332), the CHB patients who underwent biopsy with available data on TE, AST, ALT and Platelet count were included in the analysis.

FibroBox construction

The data characteristics, preprocessing and training/testing procedures of FibroBox were described in Supplement Material 1. All variables were normalized in order to minimize systematic errors from different centers. And then algorithm models (Supplement Material 1) were used to select significant variables and conduct training and validation. The machine learning algorithm was implemented using Python 3.7 (Amsterdam, Netherlands).

Statistical analysis

The diagnostic accuracy of FibroBox and conventional fibrosis markers (APRI, FIB-4 and Fibroscan) was estimated using the area under the receiver operating characteristic curve (AUROC) and the rate of correctly classified fibrosis/cirrhosis. Delong’s test [19] with a significant level of 0.05 was used to compare AUROC values of the FibroBox and other markers. Agreements between them were described using Cohen’s kappa coefficient. The decision curve analysis (DCA) and ROC analysis were computed with R 3.5.1. Statistical analysis was conducted using SPSS 19.0 (SPSS Inc., Chicago, IL, USA).

Results

Study population

Between July 2008 and November 2017, 1843 HBV-infected patients were retrospectively enrolled in this study (Fig. 1). After exclusion of patients with HCC or other tumors (n = 193) and liver abscess (n = 86), histological specimens of 1393 (75.6%) patients showed eligibility. A total of 171 (9.3%) patients refused to participate in this study. After the investigation of clinical information, 14 patients were found co-infected with HDV and 26 with HIV (Fig. 1). The data of 64 patients were incomplete. Therefore, 1289 patients were finally included in the study. The TE results of all the included patients were reliable according to guidelines proposed by Boursier et al. [16]. The main characteristics of the study patients are summarized in Table 1.

Fig. 1
figure 1

Flow diagram of the study population and reasons for exclusion. CHB, chronic HBV; HCC, hepatocellular carcinoma; HDV, hepatitis D virus

Table 1 Baseline characteristics of the study population in training set (Huai’an, Jilin and Anhui) and in validation sets (Beijing)

Histopathology

No complication was reported after liver biopsy. The significant fibrosis and cirrhosis account for 63.2% (815) and 22.5% (290) of all included patients, respectively. Almost a quarter of patients (382; 29.6%) had liver activity (A2/A3) and no steatosis was reported by the histopathologists. Meanwhile, 994 (77.1%) specimens showed consistent results rendered by 2 pathologists and a final determined diagnosis was reached by a third experienced histopathologist for the remaining specimens that showed biases.

Training sets in Huai’an and Jilin

In spearman correlation analyses of original variables, the stage of liver fibrosis was associated with age, AST, GGT, total bilirubin, platelet count, WBC, PT, ALP, albumin, INR, PIIINP, type IV collagen, laminin, HA, size of spleen, diameter of spleen vein, diameter of portal vein, velocity of portal vein and Fibroscan results (Table 2). Subsequent multivariable analysis using the least absolute shrinkage and selection operator (LASSO) logistic regression (Fig. 2) and the filter method [20] (supplement material 1) selected Fibroscan results, platelet count, AST, PT, PIIINP, type IV collagen, laminin, HA and diameter of portal vein as input parameters of diagnostic models for significant fibrosis and cirrhosis.

Table 2 Selection for orginal variables associated with the presence of fibrosis stage in the training set
Fig. 2
figure 2

Feature selection by using a parametric method, the least absolute shrinkage and selection operator (LASSO) regression. a Significant fibrosis feature selection of tuning parameter (λ) in the LASSO model used 10-fold cross-validation via minimum criteria. The AUC curve was plotted versus log(λ). Dotted vertical lines were drawn at the optimal values by using the minimum criteria and the 1 standard error of the minimum criteria (the 1 – standard error criteria). The optimal log(λ) of − 3.96 was chosen. b Cirrhosis feature selection and the optimal log(λ) of − 4.83 was chosen. c LASSO coefficient profiles of the 18 initially selected features. A vertical line was plotted at the optimal λ value, which resulted in 9 features with nonzero coefficients. d LASSO coefficient profiles of the 16 initially selected features. A vertical line was plotted at the optimal λ value, which resulted in 9 features with nonzero coefficients

In the training cohort, the AUROC of the FibroBox for predicting significant fibrosis (0.914, 95% CI 0.890 to 0.938) was higher than that of the models using TE alone (0.886, 95% CI 0.856 to 0.917), APRI (0.692, 95% CI 0.643 to 0.741) or FIB-4 (0.707, 95% CI 0.659 to 0.755). The optimal cut-off value of FibroBox was 0.38.

For predicting cirrhosis, the AUROC of FibroBox (0.914, 95% CI 0.885 to 0.943) was better than that of TE (0.880, 95% CI 0.844 to 0.917), APRI (0.705, 95% CI 0.659 to 0.752) and FIB-4 (0.758, 95% CI 0.713 to 0.804). The optimal cut-off value of FibroBox was 0.56.

Validation set in Anhui

In the Anhui cohort (n = 408), fibrosis stage based on histopathology was shown as follows: 10 (2.5%) in F0, 144 (35.3%) in F1, 129 (31.6%) in F2, 66 (16.2%) in F3 and 59 (14.5%) in F4 (Table 1).

The diagnostic performance (Fig. 3a) of FibroBox was better than TE, APRI and FIB-4: AUROC at 0.88 (95% CI 0.84 to 0.92) for predicting significant fibrosis and 0.87 (95% CI 0.82 to 0.92) for predicting cirrhosis (Table 3). Applying the optimal cut-off value (0.38 for significant fibrosis and 0.56 for cirrhosis) determined in the training set, the correctly classified rate of predicting significant fibrosis and cirrhosis was both 0.81 (se: 0.80, sp.: 0.82; se: 0.51, sp.: 0.94, respectively).

Fig. 3
figure 3

The performances of the prediction models including FibroBox, TE, APRI and FIB-4 for significant fibrosis and cirrhosis in the Anhui cohort (a) and Beijing corhort (b) are assessed by the area under a receiver operating characteristic (ROC) curve

Table 3 Diagnostic performance of FibroBox, TE, APRI and FIB-4 in the validation cohorts (Anhui and Beijing)

Across the range of reasonable threshold probabilities in this cohort, DCA graphically demonstrated that FibroBox provided a larger net benefit compared with TE, APRI and FIB-4 in diagnosing significant fibrosis and cirrhosis (Fig. 4a). This became as the supplementary evidence for the comparison of FibroBox and TE (p = 0.058) in predicting cirrhosis.

Fig. 4
figure 4

Decision curve analysis (DCA) of the prediction models including FibroBox, TE, APRI and FIB-4 for significant fibrosis and cirrhosis in the Anhui cohort (a) and Beijing corhort (b)

Validation set in Beijing

In the Beijing cohort (n = 332), 26 (7.9%) were F0, 127 (38.3%) were F1, 73 (22%) were F2, 32 (9.6%) were F3 and 74 (22.3%) were F4 according to the liver histology results (Table 1).

For the prediction of significant fibrosis (Fig. 3b), it was statistically significant that the AUROC of FibroBox (0.87, 95% CI 0.83 to 0.91) was higher than that of TE (0.82, 95% CI 0.77 to 0.87, p <  0.001), APRI (0.70, 95% CI 0.65 to 0.76, p <  0.001) and FIB-4 (0.67, 95% CI 0.61 to 0.73, p <  0.001) (Table 3). For predicting cirrhosis (Fig. 3b), the performance of FibroBox (0.90, 95% CI 0.85 to 0.94) was significantly better than that of APRI (0.75, 95% CI 0.67 to 0.82, p <  0.001) and FIB-4 (0.70, 95% CI 0.62 to 0.79, p <  0.001) (Table 3). There was no significant difference between FibroBox and TE (0.89, 95% CI 0.85 to 0.94, p = 0.863). DCA also showed consistent results (Fig. 4b).

Discussion

In China, assessing the severity of CHB infection is a critical step before timely intervention [4]. TE has also been widely applied in Chinese hospitals in recent years, regardless of its high price.

To stage liver fibrosis noninvasively in patients with HBV, our study established and validated a multivariable model based on machine-learning and incorporating Fibroscan results, serum biomarker indices and ultrasonic measurements. This FibroBox model demonstrated favorable diagnostic performances in two external validation cohorts for the prediction of significant fibrosis which was superior to TE, APRI and FIB-4. The diagnostic performance of FibroBox for predicting cirrhosis was potentially better than TE, which required more validations.

It was reported that Fibroscan performed better than serum biomarker indexes in predicting significant fibrosis and cirrhosis in Chinese cohorts [21, 22]. In our study, TE measurements were obtained within a month after liver biopsy. The optimal cut-off values of Fibroscan for significant fibrosis and cirrhosis in both validation sets were 7.8 and 11.3 kpa, both close to those proposed in other countries [23,24,25]. Regardless of set types and prediction goals, all the AUROC results of TE were over 0.8, which was acceptable but not efficient enough. Our study excluded obese patients (BMI ≥30 kg/m2), thus ruling out an error leading to unreliable TE results. Fibroscan is not widespread because of its high cost (€34,000 for a portable device and €5000 for its annual maintenance), but its high diagnostic efficiency also makes it recommendable [5, 26]. FibroBox behaved better than TE according to AUROC comparisons (Table 3, Fig. 3) and DCA curves (Fig. 4). Although the difference between FibroBox and TE for cirrhosis is not significant, the imbalance of data can also affect the validation results. For instance, less than a quarter of included patients were cirrhotic (Anhui: 14.5%; Beijing: 22.3%).

The application of Fibroscan is limited by ascites and not so reliable compared as two-dimensional (2D) shear wave elastography (SWE) [27, 28]. However, 2D-SWE has not been widely applied like Fibroscan in China. Therefore, this study took TE as the only input variable. In addition, TE has the advantage of staging liver fibrosis regardless of causes (HBV, HCV or nonalcoholic fatty liver disease [NAFLD]). FibroBox only focused on the HBV-induced liver fibrosis, which required more similar studies about other kinds of fibrosis.

The prediction accuracy of APRI and FIB-4 observed in this study was unacceptable. The AUROC of APRI was 0.66 (0.60 to 0.73) in the Anhui cohort and 0.70 (0.65 to 0.76) in the Beijing cohort in predicting significant fibrosis, and 0.72 (0.65 to 0.79) in the Anhui cohort and 0.75 (0.67 to 0.82) in the Beijing cohort in predicting cirrhosis. The diagnostic performance of APRI in the prediction of cirrhosis was better than that of which in the prediction of significant fibrosis. The AUROC value of FIB-4 in predicting cirrhosis in the Anhui cohort was significantly higher than that of APRI (P = 0.009), indicating FIB-4 might have a prediction efficiency between those of APRI and TE. In addition, the optimal cut-off values of APRI and FIB-4 were both calculated with Youden index (sensitivity + specificity - 1), and the optimal cut-off value of APRI was quite different from that recommended by the WHO guidelines [29], reminding of the instability and unreliability of APRI guideline-suggested cutoff values for the prediction of fibrosis in Chinese cohort.

There are several limitations in this study. First, the robustness of data was limited because of the retrospective researches. However, the size of research data is large and four centers participated in this study which can ensure the applicability and reliability of established models. We designed a two-validation-set study similar to that conducted by Lemoine et al. [25]. Second, the data sample inconsistency affected the model validations. For instance, the proportion of cirrhosis was only 14.5% in Anhui cohort, meaning that it cannot be taken as a training set, because this proportion is not enough to discriminate cirrhosis (F4) from non-cirhosis (F0–3). Third, the FibroBox is complicated and involves 10 parameters. However, the cost-effectiveness of this might not be poor because these 10 input parameters can be obtained through clinical examinations and the run time of FibroBox is only a few seconds. Finally, several parameters such as PIIINP, type IV collagen, laminin and HA are not readily available in clinical laboratories. We can develop several easily obtained ratios similar to the study conducted by Yuan et al. [30]. Future versions of FibroBox should focus on the simplification with accuracy.

Conclusions

In conclusion, compared with TE, APRI and FIB-4, FibroBox may be a superior noninvasive fibrosis indicator to predict the fibrosis stage in Chinese patients with CHB. The FibroBox requires further validation in other parts of China or other countries.