Introduction

For several tasks related to medical imaging, ML is emerging as a new reliable tool due to its high performance and a superior capacity to build complex models for making predictions [1]. More than 220 medical devices using ML have been approved in the USA and Europe [2]. This development has increased steadily since 2014. Today, ML software can be considered a medical device [3].

Computer tomography (CT) imaging plays an essential role in diagnostics and post-treatment follow-up in liver diseases [4]. Applying ML-based tools to CT images has shown promising results [5]. It has been tested theoretically for tasks including identification and segmentation of the liver, lesions, blood vessels, and bile ducts in the liver [6], quantification of liver tissue characteristics [7], evaluation of cancer treatment, and prediction of liver disease [8, 9].

A recently published systematic review and meta-analysis demonstrated the diagnostic accuracy of deep learning (DL) in ophthalmology, respiratory medicine, and breast surgery [10]. In addition, a limited literature review has been published in the subfield of ML applied to liver imaging [11,12,13]. However, the performance and clinical applicability of ML in liver imaging are not comprehensively addressed in the literature.

A search in PROSPERO—a database of prospectively registered systematic reviews in health and social care [14]—did not reveal any forthcoming publication in this rapidly developing field. We, therefore, conducted a systematic review from a clinical perspective.

This review aims to answer the following questions: (1) How is ML applied in CT liver imaging? (2) How well do ML systems perform in liver CT imaging? (3) What are the clinical applications of ML in liver CT imaging?

Some important part of this article is given in the electronic supplementary material due to length of the article.

Methods

This systematic review was conducted in accordance with the guidelines for the “Preferred Reporting Items for Systematic Reviews and Meta-Analyses” extension for diagnostic accuracy studies statement [15]. A selection and retrieval of studies from the literature was done in accordance with Cochrane handbook for systematic review [16]. A search was conducted in Medline, EMBASE, and Web of Science and included studies published between January 1, 2011, and October 31, 2021. The search string consisted of exploded MeSH-terms, Emtree-terms, and free text to find all studies containing the terms “Artificial intelligence” AND “Computed tomography” AND “liver” (or containing all possible synonyms of all three) in the title, abstract, or keywords. The exact search string was given in the electronic supplementary material.

When considering study quality, we identified characteristics as important given in the electronic supplementary material. The suggested list is comprehensive, and studies might be quite informative with minimal risk of bias, without meeting all requirements [17]. Yet, if a study followed only few of the characteristics, it was not considered well-documented for clinical use.

Results

The search was conducted in two phases, one in October 2020 and one in October 2021. There were 191 studies included for review. The selection process is illustrated in the PRISMA flow diagram in Fig. 1 [18]. The selected studies are summarized in Table 1 and details given in the electronic supplementary material.

Fig. 1
figure 1

Prisma flow chart. Flow chart of systematically included 191 studies from 1334 identified studies from Medline, Embase, and Web of science

Table 1 Description of included studies with detail about included in group, document type A = article and PP = proceeding paper, type of journal – medical or non-medical, AI method used, amount of test set, external validity status, ML to clinician, using of publicly available datasets

We encountered studies with 19 different aims. To make comparison and discussion more feasible, we divided these studies into five groups according to study aim: (1) liver segmentation; (2) lesion segmentation; (3) lesion detection; (4) classification of liver or liver lesions; (5) miscellaneous/other. Aims are illustrated in electronic supplementary material. There is some overlap in the groups due to several studies having multiple aims. Detailed characteristics of included studies are given in supplementary tables.

Liver segmentation

The aim of liver segmentation was the primary or secondary study aim in eighty-four of the included studies. Of those, fifty-one are journal articles [20, 24, 29,30,31,32,33,34,35, 38,39,40,41, 43,44,45,46,47, 49, 55,56,57,58, 62, 63, 65, 68, 70,71,72,73,74,75,76,77,78,79, 81, 84,85,86,87, 89, 91, 93,94,95, 97, 98, 196, 197], and 33 are proceeding papers [19, 21,22,23, 25, 26, 36, 37, 42, 48, 51, 53, 54, 59,60,61, 64, 66, 67, 69, 80, 82, 83, 88, 90, 92, 96, 99, 100, 103, 198]. The liver segmentation was done from the CT as a whole liver, not the clinical segmentation, e.g., Couinaud segments of the liver. Overall, this group of studies has contributed considerably with technically sound methods and experimented with various subdomains of ML, especially DL.

The quality of many recent studies has improved using external validation method to provide better generalizability. Though comparing directly with human experts is preferred, only eleven studies were found to do so.

The study group gives insinuation of obtaining labeled medical data which is challenging, as two-thirds of studies used datasets open for public use for training or testing their ML model. The dataset from LiTS 2017, which was the most frequently used, included 131 patients in their test set [199].

The attempt of transparency in reporting models’ performance was seen in many studies, though out of eighty-seven studies, only 11 reported their results with confidence interval or standard error; thus, further analyses of the result were not feasible in the group.

DICE score was used in most studies in this group to describe the model’s ability to predict which pixel contains the liver. The highest DICE reported was a score of 0.9851 [41], and the lowest score was 0.75 [94]. Other measures to describe the model’s performance were scattered, including AUC-ROC and accuracy (Table 2). Dong et al also reported a DICE of 0.92 and an accuracy of 0.9722 from their study, and the AUC of 0.96. References of studies in the group are in Table 3.

Table 2 Definition of performance and outcome measures
Table 3 References of studies in each category according to characteristics

Lesion segmentation

This group of studies performed segmentation of liver lesions from CT images with ML. The model’s goal was the highest possible truthfulness of segmented lesions compared to ground truth. Sixty studies had lesion segmentation as a primary or secondary study aim. Thirty-six are journal articles [24, 29, 31, 32, 38, 46, 47, 55, 56, 62, 72, 78, 84, 91, 93, 94, 97, 98, 102, 111, 115, 117, 118, 122, 124, 125, 130, 133,134,135, 137, 138, 140, 201], and twenty-four [22, 37, 42, 64, 65, 68, 82, 88, 92, 96, 99, 103, 108, 121, 124, 126,127,128,129, 131, 132, 136, 139, 200] are proceedings papers.

Several models have shown remarkable segmenting ability for predicted lesions larger than 2 cm in diameter, while almost every model is still struggling to segment lesion size less than 1 cm in diameter. However, this is comparable with clinicians in the clinical setting. Another limitation for the model to predict the lesion was quality of CT images. Several more recent studies used voxel-wise (3D pixels) classification. This could use more available information and give output in 3D to improve performance.

Validation of the model with external validation and ML to humans is improving for this group, and twenty-six studies have used external validation. Only six studies have compared their model with human experts.

More than half of the studies have reported performance in a DICE score in this group. The score range was seen skewed in different studies with the range of 0.44–0.96; a selection of lesion size played a key role here for higher performance or higher DICE score. Another informative measure called Volume Overlap Error (VOE) gives the difference between predicted and ground truth in an area. Thus, 0 is the optimal score. Twenty-two studies reported VOE, with a 0.01–0.46 mm range. Other measures were dispersed in different studies, including accuracy, AUC, precision, or PPV. Few studies have reported their performance with confidence intervals or standard errors—references of studies in the group in Table 3.

Lesion detection

Twenty studies had lesion detection as a primary or secondary study aim. This involves simply detecting if lesions are present in a CT image. Fifteen of them are proceedings papers [23, 26, 27, 87, 102, 104, 105, 107,108,109,110, 112,113,114, 119], and five are journal articles [101, 106, 111, 115, 202].

Several newer studies have detected lesions before segmentation of the lesions or diagnosis of the lesions with ML from CT liver images but have not reported performance of the lesion detection task of the model; thus, this group is smaller.

External validation was reported only in four studies. Most studies acquired their training data from local hospitals, and only eight studies have used data sets open for public use. DL was the choice of a subdomain of ML for this group.

Reporting of performance was seen as transparent and detailed in newer studies in all groups. In this group, performance was primarily reported in accuracy and precision, but five studies reported only false positives and true positive rate [26, 87, 101, 104, 115]. Two studies presented its result with a confidence interval or standard error. It is worth mentioning that the study reporting the best precision only performed internal validation on the relatively small, public dataset 3D-IRCADb—references of studies in the group in Table 3.

Classification of liver or lesions

Included studies in this group classifying the type and severity of lesions or tumors, grading hepatocellular carcinoma (HCC), and differentiating between HCC, hemangioma, and metastases. Most studies differed only between two categories, such as classifying tumors as either benign or malign. Forty-seven studies had the classification of liver or liver lesions as a study aim. Thirty-four of them journal articles [56, 71, 72, 74, 78, 141,142,143,144,145,146, 148,149,150,151,152, 154, 156,157,158,159,160,161, 164,165,166,167,168,169,170,171,172, 202, 203], and thirteen are proceedings papers [27, 64, 65, 68, 75, 82, 119, 147, 153, 155, 162, 163, 204]. For classification of liver or liver lesions, traditional machine learning, e.g., support vector machines and random forest models, and deep learning models were commonly used. 

Nine studies compared their model performance directly to one or more clinicians in a competition-based comparison. Only 12 studies have used datasets open for public validation, and even fewer are needed for training purposes.

Accuracy was a method of choice to present the performance in this group; thirty-one studies reported accuracy, with a range of 0.76–0.99. Sixteen studies reported AUC, with a range of 0.68–0.97. Precision was reported in fourteen studies. The precision range was 0.82–1.00. Note that both Sreeja et al and Romero et al reported a perfect precision of 1.0, which Sreeja et al commented was possible due to the small size of their data set [153, 155]. Only three studies presented their result with a confidence interval—references of studies in the group are in Table 3.

Other/miscellaneous

The last and most diverse category we found eligible to compare was miscellaneous, including 29 journal article [6, 8, 9, 33, 50, 52, 56, 71, 161, 164, 173,174,175,176,177,178,179, 181,182,183,184, 186,187,188, 191, 194, 195, 205, 206] and 8 proceeding paper [27, 180, 185, 189, 190, 192, 193, 207] total thirty-seven studies. The aims of the studies are clinical-oriented.

Seven studies have performed liver fibrosis staging [33, 173,174,175,176,177,178] according to “Metavir” or “Fibrosis-4” classification [208, 209]. Four compared algorithms performance with human expert while two studies performed external validation. Only two studies used public dataset for liver segmenting purpose; however, private datasets were used for fibrosis staging training and validation purpose in all the seven studies. ML method like SVM, k-nearest neighbor were used traditionally but in the recent studies, CNN-based systems using different classifier to extract the feature from the liver image are gaining more attention. Jung et al used liver and spleen volumetric indices and perform the pathologic liver fibrosis staging with CNN [177]. Comparison of ML algorithm to 3 radiologists’ assessment of liver fibrosis staging was performed with more accurate result in ML group [33].

Six studies segmented blood vessels in the liver from CT images, including portal and liver veins [52, 179, 183,184,185, 191]. Twelve studies reported a DICE score with a range of 0.68–0.98. The four studies reported accuracy with a range of 0.91–0.98, with a mean of 0.96 and a median of 0.97. Five studies stated that they externally validated their model.

Five retrieved focal liver lesion images as a study aim [50, 186, 187, 192, 206]. These studies showed how models could improve clinical workflow by retrieving similar cases in medical records, including earlier expert opinions.

Two studies, published as journal articles, predicted liver metastases within colorectal cancer patients [8, 9]. They reported AUC equal to 0.86 ± 0.01(12) and 0.747 ± 0.036.

One study focused on the segmentation of bile ducts and stones in the intrahepatic bile duct—hepatolith and reported DICE of 0.90 and 0.71 for bile duct and hepatolith segmentation, respectively [6].

Three study focused on response evaluation after chemotherapy or radio-embolization of malignant liver lesions using texture analysis [161, 181, 182]. They compared texture analysis predictions with survival and serologic response and reported an accuracy of 0.97, sensitivity of 0.93, and specificity of 1.0. This was after training on sixty-two patients and testing using cross-validation.

Two recent studies have predicted liver reserve function using Child–Pugh classification [164, 189] and Thuring et al have compared the results from their ML model with results from clinicians. Prediction of Child–Pugh accuracy was 53%, classification of Child–Pugh A vs B: accuracy was 78%, sensitivity 81%, specificity 70%, and AUC 0.80. Wang et al had preoperatively predicted early recurrence in HCC. One study has predicted overall survival of patients with unresectable HCC treated by transarterial chemoembolization [176]. This study also presented fusion of clinical data with ML model. References of studies in the group in Table 3.

Discussion

We found that ML is applied to liver CT imaging for various clinical oriented aims and covering a broad spectrum of applications.

At least one-third of studies were documented to perform very accurately on reliable, but small data. Unfortunately, reporting of performance was seldom appropriate due to lack of details. To our knowledge, there exists no standardized form of presenting results for machine learning models applied to medical imaging.

Several studies reported models that were close to clinical application. However, clinical validation with thorough documentation of both model and data (training and validation) to assess quality and generalizability were lacking. Evaluation of the model by only analysis of a result parameters would be imperfect [210].

Almost all studies that performed segmentation of liver structures from the CT images of the abdomen used deep learning models, mainly the subtype CNN. Open-access datasets and competitions like LiTS 2017 contribute substantially to the development of ML applied to liver imaging, as more than 280 studies report their model performance in a standardized format, and the competition is still ongoing with cumulative comparison. U-Net a sub domain of CNN is used by many participants and have shown promising result. The distribution of sources of dataset used by studies included in this review is illustrated in Fig. 2. The use of complex models and targeting for complex aims like automatic liver fibrosis staging, treatment response evaluation, prediction of occurrence of liver metastases, and liver blood vessels segmentation for traditional anatomical landmarks, e.g., Coineaud classification, are getting more common and may herald a maturing process in the field.

Fig. 2
figure 2

Distribution of used dataset in the model for training and validation purpose. Publicly available datasets include Lits 2017, 3D-Ircadb, Sliver 2017 and other, while private dataset were mostly collected from local hospitals

ML systems showed promising results on retrospective data for several tasks related to CT imaging, as some segmentation studies reported models with more than 98% ability to predict which pixels or voxels contained liver in abdominal CT scans. Further, several studies reported 95% performance compared to ground truth for liver or liver lesions classification. In recent years, identified studies have used ML for prediction of occurrence or treatment effect of liver metastases, liver vessel segmentation, and evaluation of treatment effect on liver malignancy. These showed results around 70–80% of ground truth.

Other applications such as classification of liver fibrosis stage and prediction of benign or malign lesions showed promising results and potential for the high impact of ML in future routine clinical practice.

Reporting of model performance should give in the state-of-the-art visualization methods, e.g., confusion matrix. In the studies like segmentation task, measuring parameter like mean surface distance with standard error should be reported to get overall transparency of the model performance [116]. Sixty-two studies identified in this review have such breach in reporting of model performance. This makes it difficult to get a good overall understanding of the field, especially for clinicians. We encourage the readers to assess such results with caution.

Further, reporting of standard error and confidence intervals was often lacking. We recommend that it should be reported by default. This problem was also seen in other applications of ML to medical images, and we concur with the need for reporting standards for medical application as stated by Aggarwal et al [10].

There are potentially many applications of ML in liver CT imaging have been identified thorough this review, especially in the miscellaneous group aims are clinically derived, while segmenting of liver and its lesions could implement as diagnostic and treatment planning tool. Studies in classification group could serve diagnosis of different lesions, e.g., different types of malign and benign tumors, or severity of the liver cirrhosis. Despite the promising performance reported in many studies, clinical applications of ML in liver CT imaging have to pass through the corridor of clinical validation and clinical trials [210].

The main issues identified in the literature were limited access to high-quality data and lack of clinical validation. External validation is becoming more popular among developers, illustrated in Fig. 3, but it is insufficient to qualify for medical application. There is an urgent need for a shift in focus towards clinical validation in this field. Scholars should perform feasibility studies in clinical routine, and design and carry out prospective studies to validate the performance of ML tools in realistic clinical settings. Developers should seek to collaborate with clinicians in this process. Strength and weakness of the study as well future perspective is given in the supplementary material.

Fig. 3
figure 3

Bar-chart categorize by validation method in timeline. An increasing trend of external validation from 2011 to 2021 are demonstrated in dotted line

Conclusion

We found reports of many ML applications to liver CT images in the literature, including automatic liver and lesion segmentation, lesion detection, liver or lesion classification, liver vessel segmentation including bile ducts, fibrosis staging, metastasis prediction, and evaluation of chemotherapy as treatment of hepatocellular carcinoma and retrieval of relevant liver lesions from other similar cases. Several were documented to perform very accurately on reliable but small data. Deep learning models and classification models of ML were commonly used. However, presenting the result of studies is not standardized in the literature. Some studies were close to reporting sufficient details on clinical relevance, data characteristics and quality, algorithm characteristics and bias, and performance measures on external data to be considered ready for clinical use. Further prospective, clinical studies are recommended, and the need for a more interactive technological and medical research is evident to achieve a secure clinical use of ML methodology in this field.