Background

Chronic liver diseases and cirrhosis are the 11th leading cause of death in the world, accounting for 1.1 million deaths annually [1]. The global prevalence of cirrhosis has been substantially rising from 71 million in 1990 to over 122 million in 2017 [2]. Common causes of cirrhosis are chronic hepatitis B virus (HBV) and hepatitis C virus (HCV) infections, alcohol-related liver disease and nonalcoholic steatohepatitis (NASH) [2]. Over the past decade, there has been a temporal shift in the prevalence of causes of cirrhosis, i.e., the prevalence of NASH has been dramatically increasing, whereas the prevalence of other causes has been slowly decreasing [3]. The estimated worldwide prevalence of nonalcoholic fatty liver disease (NAFLD) is 25% [4] and is projected to be to 33.5% by 2030, emphasizing the importance of both cirrhosis and NAFLD [5].

The spectrum of liver fibrosis ranges from minimal fibrosis to full-blown cirrhosis [6]. Patients with early cirrhosis are mostly asymptomatic because the liver is able to compensate. However, without a prompt diagnosis and proper treatments, it can quickly deteriorate to decompensated cirrhosis, which eventually leads to complications and mortality. Patients with decompensated cirrhosis have an approximately tenfold higher risk of death than general populations [7]. Therefore, the detection and treatment of early-stage fibrosis and NASH can slow disease progression, reduce the risk of liver cancer and decrease mortality.

The gold standard for the diagnosis and staging of liver fibrosis and NAFLD is liver biopsy. However, liver biopsy is an invasive procedure that can lead to complications such as hemorrhage, biliary peritonitis and pneumothorax [8]. Another drawback of liver biopsy is a high rate of sampling error with interobserver and intraobserver variation in histologic evaluations [6, 9]. Additionally, liver biopsy is not always feasible as a follow-up method for liver diseases. Accordingly, serum markers and imaging modalities have been developed as alternative noninvasive diagnostic methods for liver fibrosis, but they have limited performance, particularly for early-stage fibrosis [8, 10]. For example, the sensitivity and specificity of the aspartate aminotransferase-to-platelet ratio index (APRI) are 69% and 77%, respectively, and those of the Fibrosis-4 (FIB-4) score are 69% and 78%, respectively, for the detection of advanced fibrosis [11]. Various imaging modalities, e.g., magnetic resonance elastography (MRE), have also been used for the diagnosis and classification of liver fibrosis with relatively reliable accuracy [12]. However, the availability of these modalities is limited. The performance of most of these tests needs to be improved.

Since the twenty-first century, there have been significant advancements in artificial intelligence (AI) technology, resulting in applications of AI in several aspects of medicine, particularly in aiding diagnosis. In gastroenterology, AI-assisted systems have been studied in various diseases such as the endoscopic detection and classification of colorectal cancer [13, 14]. Regarding the application of AI in liver diseases, machine learning algorithms has been developed to predict risk and outcomes of diseases using multiple clinical parameters, e.g. assessment of liver fibrosis and steatosis, predicting liver decompensation in primary sclerosing cholangitis, screening and selection of liver transplant recipients as well as predicting post-transplant survival and complications [15].

There have been some previous systematic reviews on AI in gastroenterology and liver disease [15, 16], however, very few meta-analyses have been conducted to evaluate the performance of the AI-assisted systems. In this systematic review and meta-analysis, we focused mainly on liver parenchymal diseases, i.e., liver fibrosis and steatosis. The main objective of this study was to assess the performance of AI-integrated noninvasive tests for the diagnosis and staging of liver fibrosis and steatosis.

Methods

The study was conducted based on the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) checklist.

Search strategy

We searched for studies on AI in liver fibrosis and steatosis. A literature search was conducted on MEDLINE, Scopus, Web of Science and Google Scholar databases. The search was conducted from the year 2000 through January 2020. We opted to exclude studies published before 2000 because most of these studies utilized obsolete computer-assisted algorithms that are currently no longer used in the modern AI era. Keywords for the search were as follows: “artificial intelligence”, “computer-assisted”, “computer-aided”, “neural network”, “machine learning”, “deep learning”, “liver”, “hepatic”, “parenchyma”, “parenchymal”, “fibrosis”, “cirrhosis”, “steatosis”, “fatty”, “NASH”, and “NAFLD”.

Inclusion and exclusion criteria

We included all articles focusing on the utilization of AI in the diagnosis and/or staging of liver fibrosis and steatosis. The inclusion criteria were as follows: participants included in the study underwent liver biopsy as the gold standard for the diagnosis of liver fibrosis and steatosis. The reported results were sufficient for generating 2 × 2 tables, and the articles were in English. The exclusion criteria were as follows: articles that did not report our desired outcomes of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV); studies that did not provide sufficient information to calculate true positive (TP), false positive (FP), true negative (TN) and false negative (FN) values; articles that did not clearly report training and test datasets or did not contain information on validation methods; and conference proceedings or abstracts with incomplete information on population, AI methods, and validation methods.

Data extraction and quality assessment

Two authors (PD and TT) independently performed data extraction and quality assessment. Any disagreements were discussed with the third author (RC). Data extracted included the author, publication year, country where the study was conducted, study design, liver diseases/conditions, diagnostic modalities, number of participants, type of AI models, number of samples in the development and validation cohorts, validation method (e.g., k-fold cross validation, independent cohort), sensitivity, specificity, and crude number of TP, FP, TN and FN values. For the studies that developed multiple AI models, we included the AI model that had the best overall performance in the main analysis. Our criterion for the best overall performance was to calculate the mean between the sensitivity and specificity, i.e., (sensitivity + specificity)/2 [17]. This criterion was used because we equally emphasized the sensitivity and specificity. In the diagnosis of liver fibrosis, especially cirrhosis, we would like a diagnostic test to be sensitive in order to early detect liver fibrosis. However, we would also like to avoid incorrectly diagnosing patients as having liver fibrosis when they actually do not have the condition. Therefore, we opted for methods with a balanced false negative (sensitivity) and false positive (specificity) [17]. Moreover, sensitivity and specificity do not depend on prevalence or incidence in validation cohorts. We also extracted performance of AIs with the best sensitivity and specificity in studies with multiple AIs models in order to further perform sensitivity-focused and specificity-focused analysis.

Quality assessment

The methodological quality of the included studies was evaluated using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [18]. The QUADAS-2 tool comprises 12 questions regarding 4 domains including patient selection, index test, reference standard, and flow and timing. Some questions were slightly modified to specifically assess studies on AI. For example, in clinical studies on diagnostic tests, prespecified thresholds of the index test should be set prior to data collection and analysis to prevent post-hoc data analysis for the desired results. For AI research, we assessed this issue by identifying whether the developed AI model was validated in another set of cohorts apart from the training cohorts, e.g., test set, or external validation cohorts. Details of the modified QUADAS-2 tool are provided in the Supplemental methods.

Statistical analysis

After data extraction, the TP, FP, TN and FN values, if not available, were calculated using Review Manager version 5.3.5 [19]. All statistical analyses were performed using R software, version 3.6.3, Vienna, Austria [20]. The pooled sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and diagnostic odds ratio (DOR) with 95% confidence intervals (95% CIs) were calculated from the crude number of TP, FP, TN and FN values of each study using a random effects model. The summary receiver operating characteristics (SROC) curve was generated, and the area under the curve (AUC) was calculated to determine the diagnostic accuracy of the AI-assisted system. AUC values of 0.5–0.7, 0.7–0.9, and 0.9–1 indicate low, moderate and high accuracy, respectively [21]. Heterogeneity was assessed using I2 and Cochran’s Q statistics. To determine the source of heterogeneity, subgroup analyses and regression analysis based on diagnostic modalities, population and AI classifiers were performed. Publication bias was assessed with the Deeks funnel plot. P values of < 0.05 were considered statistically significant.

Results

Literature search

The search results and process of selecting articles are shown in Fig. 1. After the literature search, a total of 297 articles were identified. Articles were excluded for the following reasons: studies that were duplicated (n = 149), studies that were conducted in animals (n = 10), studies focusing on diseases other than liver parenchymal diseases (n = 11), studies that were not original research, i.e., reviews, editorials (n = 35), studies that were not written in English language (n = 6), studies that did not report the desired outcomes or validation population characteristics (n = 2), and studies that did not use liver biopsy as the gold standard (n = 4). Eventually, a total of 80 articles were included in the qualitative analysis and snowballing, of which 19 were included in the quantitative analysis (17 studies on liver fibrosis and 2 studies on NAFLD). There were 12 studies integrating AI with imaging modalities, i.e., ultrasonography [22,23,24,25,26], elastography [27, 28], computed tomography (CT) [29, 30] and magnetic resonance imaging (MRI) [31, 32], to facilitate the diagnosis of liver fibrosis and NAFLD. The other 7 studies developed AI models using clinical and laboratory data, such as the presence of other underlying diseases or ascites, liver chemistry tests, and platelet and white blood cell counts, to predict liver fibrosis stages [33,34,35,36,37,38,39]. Regarding the types of AI, 6 studies used convolutional neural networks (CNNs) [22, 24, 28,29,30, 32], 6 studies used artificial neural networks (ANNs) [25, 26, 35,36,37, 39], 5 studies used multiple AI models [23, 27, 33, 34, 38] and 2 studies used a support vector machine (SVM) [31, 40]. The study characteristics, sensitivity, specificity, prevalence, validation methods and other extracted data from the included studies are shown in Table 1. The methodological assessment by QUADAS-2 is summarized in Additional file 1: Table S1.

Fig. 1
figure 1

Flow diagram of search methodology and literature selection process

Table 1 Characteristics of included studies

Overall performance of AI in the diagnosis of liver cirrhosis

First, we focused on the performance of AI in diagnosing liver cirrhosis (METAVIR F4). A total of 11 studies were included in this analysis [22, 25, 27,28,29,30, 32, 33, 35, 38, 39]. Five studies developed AI models using CNNs [22, 28,29,30, 32], 3 used ANNs [25, 35, 39], and the other 3 studies developed multiple AI models [27, 33, 38]. Different imaging modalities were also employed as inputs for the AI systems: ultrasound was used in 2 studies [22, 25], elastography in 2 studies [27, 28], CT in 2 studies [29, 30], and MRI in 1 study [32]; 4 studies used multiple clinical and laboratory parameters as AI inputs [33, 35, 38, 39]. The results of the meta-analysis showed that AI-assisted systems were able to diagnose cirrhosis with a pooled sensitivity, specificity, PPV, and NPV of 0.78 (95% CI: 0.71–0.85), 0.89 (95% CI: 0.81–0.94), 0.72 (95% CI: 0.58–0.83) and 0.92 (95% CI: 0.88–0.94), respectively. The pooled DOR was 31.58 (95% CI: 11.84–84.25) (Fig. 2). For the sensitivity-focused analysis of the 11 studies, there was no change in the pooled sensitivity. On the other hand, the pooled specificity increased to 0.94 (95% CI: 0.86–0.97) in the specificity-focused analysis (Additional file 1: Table S2).

Fig. 2
figure 2

Sensitivity (a), specificity (b), positive predictive value (c), negative predictive value (d) and diagnostic odds ratio (e) of AI-assisted diagnosis of liver cirrhosis (F4) with subgroup analysis according to diagnostic modality (ultrasonography, elastography, computed tomography and clinical data)

Overall performance of AI in the diagnosis of advanced fibrosis (METAVIR ≥ F3) and significant fibrosis (METAVIR ≥ F2)

We identified 10 studies using AI models to diagnose advance fibrosis (≥ F3) [27,28,29,30, 32,33,34, 37, 38, 40]. Four studies developed CNNs [28,29,30, 32], 1 study developed an ANN [37], 1 study utilized SVM [40], and the other 4 studies developed multiple AI models [27, 33, 34, 38]. The AI models were integrated into elastrography in 2 studies [27, 28], CT images in 2 studies [29, 30], MRI images in 2 study [32, 40] and clinical and laboratory parameters in the other 4 studies [33, 34, 37, 38]. After combining all studies, AI-assisted analysis systems had a pooled sensitivity, specificity, PPV and NPV of 0.86 (95% CI 0.80–0.90), 0.87 (95% CI 0.80–0.92), 0.85 (95% CI 0.75–0.91), and 0.88 (95% CI 0.82–0.92), respectively, and a DOR of 37.79 (95% CI 16.01–89.19) for the diagnosis of advanced fibrosis. Sensitivity and specificity-focused analysis found similar pooled sensitivity but increased pooled specificity to 0.89 (95% CI 0.81–0.93). (Additional file 1: Table S2).

There were 8 studies investigating the performance of AI-assisted systems for the diagnosis of significant fibrosis (≥ F2) [22, 23, 27, 28, 30, 32, 36, 38]. Four studies used CNNs as AI models [23, 28, 29, 31], 1 study utilized an ANN [36], and the other 3 studies used multiple AI models [23, 27, 38]. In this group, the AI models were integrated into ultrasonography in 2 studies [22, 23], elastography in 2 studies [27, 28], CT in 1 study [30], MRI in 1 study [32], and clinical and laboratory parameters in 2 studies [36, 38]. We found that the pooled sensitivity, specificity, PPV and NPV were 0.86 (95% CI 0.78–0.92), 0.81 (95% CI 0.77–0.84), 0.88 (95% CI 0.80–0.93) and 0.77 (95% CI 0.58–0.89), respectively, and the DOR was 26.79 (95% CI 14.47–49.62). In the sensitivity-focused analysis, the pooled sensitivity increased to 0.91 (95% CI 0.76–0.97) while the specificity remained the same in specificity-focused analysis. (Additional file 1: Table S2).

Subgroup analysis by diagnostic modality

We observed substantial heterogeneity in the overall performance of AI-assisted diagnosis system, e.g., I2 was 79%, 95%, 93%, 82% and 93% for the pooled sensitivity, specificity, PPV, NPV and DOR, respectively, for the diagnosis of liver cirrhosis. We conducted additional subgroup analyses by diagnostic modality for each stage of fibrosis (Table 2). As expected, there were statistically significant differences in the pooled sensitivity, specificity, PPV, NPV and DOR among different diagnostic modalities. In most subgroups, the I2 values were markedly decreased.

Table 2 Sensitivity, specificity, positive predictive value, negative predictive value and diagnostic odds ratio of AI-assisted diagnosis of significant liver fibrosis (F2–4), advanced fibrosis (F3–4) and cirrhosis (F4) with subgroup analysis according to diagnostic modality (ultrasonography, elastography, computed tomography, clinical data) and population (at-risk population, general population)

For the diagnosis of cirrhosis, the pooled sensitivity, specificity, PPV, NPV and DOR of different diagnostic modalities were significantly different. The sensitivities were 0.79 (95% CI 0.73–0.84), 0.87 (95% CI 0.50–0.98), 0.84 (95% CI 0.80–0.87), and 0.65 (95% CI 0.58–0.72), and the specificities were 0.93 (95% CI 0.90–0.95), 0.88 (95% CI 0.85–0.91), 0.86 (95% CI 0.43–0.98) and 0.91 (95% CI 0.74–0.97), for ultrasonography, elastrography, CT, and clinical and laboratory parameters, respectively (p < 0.01 both). Significant differences in the PPV, NPV and DOR among AI-assisted systems for the diagnosis of cirrhosis were also found (p = 0.01, < 0.01 and 0.04, respectively) (Table 2). In the subgroup analyses, the heterogeneity of most diagnostic subgroups of cirrhosis was markedly reduced. For example, I2 of the ultrasonography subgroup was 0% for the pooled sensitivity, specificity, PPV, NPV and DOR. Similarly, I2 was 0% for the pooled specificity and NPV of the elastrography subgroup, 0% for the pooled sensitivity and NPV of the CT subgroup and 0% for the pooled sensitivity of the clinical parameters subgroup (Table 2, Fig. 2).

For advanced liver fibrosis (≥ F3), we observed a smaller magnitude of differences in diagnostic performance among diagnostic subgroups, with a smaller reduction in I2 values after subgroup analyses than the subgroups of cirrhosis. For instance, a statistically significant difference was only detected in the pooled NPV among diagnostic subgroups (p < 0.01) (Table 2, Additional file 1: Fig. S1).

The results of the subgroup analyses of significant liver fibrosis (F2-4) stage were similar to those of cirrhosis, i.e., there were significant differences in the pooled sensitivity, specificity, NPV and DOR among diagnostic modality groups (p < 0.05), and the heterogeneity accessed by I2 was greatly reduced in several subgroups. The I2 values were 0% for the pooled sensitivity, specificity and PPV in the ultrasonography subgroup, 0% for the pooled sensitivity, specificity and DOR in the elastography subgroup, and 0% for the pooled sensitivity, specificity and NPV in the clinical data subgroup (Table 2, Additional file 1: Fig. S2).

Figure 3 shows the SROC curves of AI-assisted systems for the diagnosis of cirrhosis, advanced fibrosis and significant fibrosis with subgroup analysis by diagnostic modality. The overall AUC values were 0.85, 0.92 and 0.86 for the diagnosis of cirrhosis, advanced fibrosis and significant fibrosis, respectively. AUC values of subgroup analyses of different diagnostic modalities are shown in Table 2.

Fig. 3
figure 3

SROC curves demonstrating performance of AI-assisted diagnosis of liver cirrhosis (F4) (a), advanced fibrosis (F3–4) (b) and significant liver fibrosis (F2–4) (c) with subgroup analysis according to diagnostic modality (ultrasonography, elastography, computed tomography and clinical data)

Subgroup analysis by study population

We were able to identify 2 population groups in the selected studies. The first group of studies was conducted in a general population without any specific liver disease, while the second group was conducted in an “at-risk” population of individuals who already suffered from chronic liver diseases such as chronic viral hepatitis B and C infections. Therefore, we performed subgroup analyses according to the study population, i.e., the at-risk population and general population. The performance of AI-assisted systems for the diagnosis of F2-F4 fibrosis is summarized in Table 2. In contrast to the aforementioned subgroup analysis, the sensitivity and specificity of AI-assisted diagnostic systems in the at-risk population were similar to those in the general population in all stages of liver fibrosis. The heterogeneity was not dramatically reduced, and the subgroups’ I2 values remained high (70–90%). Additionally, there were no significant differences in diagnostic performance between subgroups (p ≥ 0.05) in almost all stages of liver fibrosis. Therefore, we could infer that different populations are unlikely to have an impact on the performance of AI-assisted systems for diagnosing liver fibrosis. To confirm this finding, we further performed a meta-regression analysis with population as a covariate. The mixed effects model showed no statistically significant results, with p = 0.69, 0.70 and 0.35 for F4, ≥ F3 and ≥ F2 stages, respectively.

Subgroup analysis by AI classifiers

We divided AI-classifiers of the included studies into 2 main subgroups, i.e., neural network and non-neural network. Performance of each subgroup is shown in Additional file 1: Table S3. We found that the performance of the 2 subgroups were relatively similar except for a slightly better sensitivity, specificity, PPV and DOR in the neural network group for the diagnosis of cirrhosis. There was no significant difference between AI-classifier subgroups, except for the pooled sensitivity and PPV for the diagnosis of cirrhosis as well as pooled NPV for the diagnosis of advanced fibrosis. We further stratified neural network-assisted studies by diagnostic modalities (ultrasonography, elastography, CT, MRI and clinical data) as well as population (at-risk, general population) (Additional file 1: Table S4). Furthermore, there was a reduction in heterogeneity after subgroup by modalities. For example, I2 values were 0 for the pooled sensitivity, specificity, PPN, NPV and DOR in the diagnosis of cirrhosis by neural network-assisted ultrasonography and the diagnosis of advanced fibrosis by neural network-assisted clinical parameters. Difference between modalities were also observed in the pooled sensitivity, specificity, NPV and DOR for diagnosing cirrhosis as well as specificity, PPV, NPV and DOR for classifying advanced fibrosis; whereas subgroups by population revealed no significant change in overall performance or heterogeneity.

Overall performance of AI in the diagnosis of nonalcoholic fatty liver disease (NAFLD)

Only 2 studies on the AI-assisted diagnosis of NAFLD had liver biopsy as the gold standard [24, 26]. One used an ANN, and the other one used a CNN as AI models. The pooled sensitivity, specificity, PPV, NPV and DOR were 0.97 (95% CI 0.76–1.00), 0.91 (95% CI 0.78–0.97), 0.95 (95% CI 0.87–0.98), 0.93 (95% CI 0.80–0.98), and 191.52 (95% CI 38.83–944.81), respectively, with I2 of 0% for all (Additional file 1: Table S5).

Publication bias

Deeks funnel plots were generated for publication bias assessments. The slope coefficients were relatively symmetrical with P values of 0.30, 0.21 and 0.35 for the diagnosis of cirrhosis, advanced fibrosis and significant fibrosis, respectively (Additional file 1: Fig. S3), suggesting that publication bias was not present.

Discussion

In this meta-analysis, AI-assisted models had good performance in the assessment of liver fibrosis and steatosis. Interestingly, for the detection of cirrhosis, AI-assisted imaging-based models had greater sensitivities than AI-assisted clinical-based models, i.e., 0.79–0.87 versus 0.65. By contrast, for the diagnosis of significant fibrosis, clinical-based models had a greater sensitivity (0.96 versus 0.73–0.90) but less specificity (0.78 versus 0.82–0.87) than imaging-based models. The NPV of AI-assisted models for detecting advanced liver fibrosis and cirrhosis were approximately 90%, implying that the AI-assisted models were able to help guide clinical decisions that the patients unlikely had liver fibrosis, without the need for invasive methods such as liver biopsy.

AI-aided systems have some advantages over conventional noninvasive diagnostic tools. Unlike ultrasonography, which is an operator-dependent modality, AI utilizes multiple features from ultrasonographic images as inputs to systematically analyze the images, thus reducing bias in the image interpretation. Moreover, AI-assisted diagnosis systems can potentially be used in both the general population and at-risk population. This was suggested by the results of the meta-regression analysis with population as a covariate and by the similar performance of AI-assisted systems between the 2 populations.

Transient elastography is currently the most commonly used noninvasive tool for staging liver fibrosis. A recent meta-analysis showed that transient elastography had AUCs of 0.84, 0.89, and 0.94 for the diagnosis of ≥ F2, ≥ F3 and F4 stage fibrosis, respectively [41, 42]. Real-time elastography has also been frequently used as an alternative to transient elastography with an AUC of 0.72, 0.86 and 0.69 for the diagnosis of liver cirrhosis, advanced fibrosis and significant fibrosis, respectively [43]. Our meta-analysis showed that AI-assisted elastography had higher AUCs for the diagnosis of all stages of liver fibrosis than real-time elastography. When comparing to transient elastography, AI-assisted elastography had a slightly lower AUC for identifying liver cirrhosis, but higher AUCs for classifying advanced fibrosis and significant fibrosis. Interestingly, among the 3 AI-assisted systems, AI-assisted ultrasonography had the best performance (Table 3). This could possibly be due to the difference in types of input data. Studies using AI-assisted ultrasonography incorporated inputs with relatively larger region of interests (ROIs) and extracted different categories of radiomics, compared to AI-assisted elastography studies. Therefore, AI performance could be affected by the selected inputs. Further studies to specify the most appropriate inputs for each AI classifier is warranted in order to maximize the AI performance. Due to the satisfactory performance of AI-assisted ultrasonography, AI has a potential application for staging liver fibrosis in areas where elastography machines are not available. Likewise, the FIB-4 score and APRI score are the most commonly used clinical parameters for predicting liver fibrosis. We found that, in line with the AI-assisted image analysis model, the AI-assisted clinical-based model had a lower AUC value for the diagnosis of stage F4 fibrosis but higher AUC values for the diagnosis of stage ≥ F2 and ≥ F3 fibrosis. Nevertheless, after excluding one study [35] which had a different specific population, focusing only on cirrhosis in NALFD patients, the AUC value for F4 fibrosis dramatically increased from 0.68 to 0.86 which was better than APRI and FIB-4.

Table 3 Sensitivity, specificity and area-under-the-curve (AUC) of AI-assisted ultrasonography, AI-assisted elastography, and AI-assisted clinical data for the diagnosis of liver cirrhosis (F4), advanced fibrosis (F3–4) and significant liver fibrosis (F2–4)

In this meta-analysis, we observed relatively high heterogeneity throughout the study. After performing subgroup analyses categorized by diagnostic modality (ultrasound, elastography, CT, MRI, and clinical data), the heterogeneity was dramatically reduced, i.e., the I2 value was 0% in many subgroups. Moreover, the performance of most subgroups was significantly different, indicating that the types of diagnostic modality had an impact on the performance of AI models. Interestingly, we found that AI-integrated ultrasonography had exceptional performance with a relatively low heterogeneity throughout the analyses. Because ultrasound machines are widely available, this finding suggests that AI-assisted ultrasonography has tremendous potential for being utilized in real clinical practice.

This is one of the very first meta-analyses of the AI-supported systems in diagnosis of liver diseases. Apart from publications in medical journals, we also included articles from computer science and engineering journals, resulting in a comprehensive review of AI advancements regarding this topic. To reduce the chance of overestimating the diagnostic performance of AI models, only studies that had a validation cohort or equivalent method for evaluating the performance of the developed AI models were included.

There are some limitations in this review and meta-analysis. First of all, there are several imaging modalities and AI classifiers included in the meta-analysis which contributed to the heterogeneity of the overall analysis. For different AI-assisted imaging modalities, we prespecified subgroup analysis by modalities. We also further performed subgroup analysis according to AI classifier, i.e., neural networks and non-neural networks (Additional file 1: Table S3). We observed relatively similar performance except for a relatively better performance in the diagnosis of cirrhosis in the neural networks group. Additionally, we performed another subgroup analysis of imaging modalities and population including only studies with neural network AI classifier (Additional file 1: Table S4). We found that the heterogeneity was decreased. However, it is important to note that the input modalities and AI-assisted systems were not completely identical among studies included in the analysis, interpretation of the pooled diagnostic performance needs to be done with caution. Although there were an acceptable number of studies for meta-analysis, the number of studies of each diagnostic tool was relatively small, given that several modalities are currently used for the assessment of liver fibrosis and steatosis. Therefore, the results of the subgroup analyses of each diagnostic modality need to be interpreted with caution. Furthermore, we selected only studies in which liver biopsy was used as the reference standard; consequently, some studies that demonstrated promising results but did not have liver biopsy to confirm the stage of liver fibrosis or steatosis were excluded. Nine of the 19 studies (47%) were prospective; however, none of the included studies were randomized controlled trials. Only 1 study compared the performance between AI and humans [29]. Interestingly, this study showed that the AI-aided system outperformed humans in staging liver fibrosis in CT images. Most included studies evaluated the performance of the developed AI systems on “internal” validation cohorts, of which the baseline patient characteristics were quite similar to those of the development cohort. Whether these developed AI models can be generalized to other populations in clinical practice needs to be further investigated. Moreover, long-term assessment of AI performance in real clinical settings and studies with direct comparisons between AI and conventional diagnostic methods would be beneficial in investigating real-world positive and negative impacts of the AI-assisted system.

Conclusions

This meta-analysis demonstrates the promising potential of AI systems for aiding the diagnosis and staging of liver fibrosis and NAFLD. Integrating AI into conventional noninvasive tools yields effective diagnostic tools with an optimal balance of sensitivity and specificity. Validation of these AI models in other independent cohorts is warranted before implementing these AI-assisted systems into clinical practice.