Introduction

The delivery of quality healthcare is one of the primary agendas of every nation. Medical imaging includes techniques and processes designed to visualise body parts, tissue or organs for medical purposes, including both diagnostic and therapeutic. With recent advances in Artificial Intelligence (AI) and medical imaging technologies, biomedical image analysis has transformed clinical practice by providing improved insights into human anatomy and disease processes.

Liver disease progression can be characterised by histopathological and haemodynamic changes within the hepatic parenchyma, which correlates to signs found on imaging modalities. Liver fibrosis is the most common outcome of chronic liver injury. Persistent hepatic parenchymal damage results in activation of immune cells and synthesis of fibrotic extracellular matrix components leading to scar formation, which impairs cell function [1, 2]. Progressive liver fibrosis can lead to liver cirrhosis and related complications such as portal hypertension [3]. Portal hypertension in turn leads to multiple complications including splenomegaly, ascites, varices, hepatorenal syndrome and hepatic encephalopathy. Further, the process of chronic liver injury eventually leads to hepatocellular carcinoma (HCC), with cirrhosis being the main precursor of HCC [4]. Overall, one-third of cirrhotic patients will develop HCC during their lifetime. Risk factors for chronic liver disease and eventually liver cirrhosis include chronic infection with HBV or HCV, heavy alcohol intake, and metabolic liver disease [5].

According to World Health Organisation (WHO), HCC is the fourth-leading cause of cancer-related deaths in the world [6]. The prognosis of patients with this tumour remains poor, with a 5-year survival rate of 19% at time of diagnosis [7]. Unfortunately, this is because HCC is often diagnosed at its advanced stages due to the absence of symptoms in patients with early disease, and the poor adherence to surveillance in high-risk patients. The five-year survival rate for patients whose tumours are detected at an early stage and who receive treatment exceeds 70% [8]. Therefore, early diagnosis and staging of liver diseases plays a pivotal role in reducing HCC-related deaths, as well as reducing healthcare costs.

Many computational methods have been developed for the radiological diagnosis of chronic liver disease and HCC. Among the various options, Machine Learning (ML) and Deep Learning (DL) methods have received significant attention due to their outstanding performance on disease diagnosis and prognosis. In this review, we aim to perform a comprehensive analysis of various ML and DL methods for the diagnosis of chronic liver disease and HCC. We first provide an overview of methods in the ML pipeline including pre-processing, feature extraction, and learning algorithms. We then provide an overview of Convolutional Neural Networks (CNNs), which are specialised deep learning algorithms for processing 2D or 3D data. We discuss in detail the application of various methods for liver diseases such as fibrosis, cirrhosis, and HCC. We further outline limitations in current studies and provide research directions that need attention from the scientific community.

Radiological diagnosis of fibrosis and cirrhosis

Ultrasound (US) is typically the first-line radiological study obtained in patients suspected of having cirrhosis because it is readily available, non-invasive, well-tolerated, less expensive than its CT or MRI counterparts, provides real-time image acquisition and display, and does not expose patients to the adverse effects of intravenous contrast or radiation. Changes in tissue composition in a cirrhotic liver can be detected on gray-scale US. The ultrasonographic hallmarks of cirrhosis are a nodular or irregular surface, coarsened liver edge and increased echogenicity; in advanced disease, the gross liver appears atrophied and multi-nodular (with typically atrophy of right lobe and hypertrophy of caudate or left lobes) [9]. A prospective study of 100 patients with suspected cirrhosis who underwent liver biopsy showed that high-resolution US had 91% sensitivity and 94% specificity in detecting cirrhosis [10]. In another similar study, hepatic surface nodularity, especially detected by a linear probe, was shown to be the most direct sign of advanced fibrosis, with reported sensitivity and specificity of 54% and 95% respectively [11, 12]. However, the disadvantages of US in diagnosis of cirrhosis includes high operator dependency and effect on resolution due to presence of speckle noise and fat in obese patients [13].

Ultrasonography also detects portal hypertension, which is a predictive marker of poor outcomes in cirrhosis, with reverse portal flow in decompensated cirrhosis being a poor prognostic marker [14]. Recent studies have shown that HCC incidence increases in parallel to portal pressure. B-mode US signs suggestive of increased portal pressures include increased portal vein diameter, splenomegaly, ascites and presence of abnormal collateral route; US is able to detect the onset of these complications early. Doppler US has high specificity and moderate sensitivity for the diagnosis of clinically significant portal hypertension, but is limited in detecting slow blood flow and has reduced frame rate [15, 16].

Ultrasound-based elastography is a radiological technique that is used as an alternative to liver biopsy to stage the degree of liver fibrosis. Shear wave elastography and strain elastography are the two main techniques used to evaluate liver stiffness, by essentially measuring the hepatic tissue response after mechanical excitation [17]. The accuracy of elastography has primarily been investigated in patients with chronic HCV and HBV. Overall, it has an estimated sensitivity of 70% and specificity of 85% in diagnosing significant fibrosis (F greater than or equal to 2), and 87% and 91% respectively in cirrhosis (F4). A meta-analysis of 17 studies consisting of 7,058 patients has also shown that it can be used to predict complications in chronic liver disease patients, with baseline liver stiffness associated with risk of hepatic decompensation (relative risk [RR] 1.07, 95% CI 1.03\(-\)1.11), HCC development (RR 1.11, 95% CI 1.05\(-\)1.18), and death (RR 1.22, 95% CI 1.05\(-\)1.43) [18]. However, it can be limited in its use in some cases where other factors affect measured liver stiffness, including elevated central venous pressures in patients with severe cardio-respiratory disease, obesity and anatomic distortion.

Given the high sensitivity of ultrasound diagnosis of cirrhosis, CT or MRI is not typically required for diagnosis. However, [12] did show that early stages of liver parenchymal abnormalities and morphological changes in the liver on MRI and CT were predictive of cirrhosis by multivariate analysis (the diagnostic accuracy being 66.0%, 71.9% and 67.9% for US, CT and MRI respectively). Furthermore, physiological parameters have been identified from measurements on multiphase CT as markers of fibrosis - for example, changes in liver perfusion, arterial fraction and mean transit time of contrast do correlate well with severity of cirrhosis (by Child-Pugh classification) [19, 20]. These techniques are yet to be validated in multi-centre trials and remain investigated at this stage.

Diffusion-weighted and contrast-enhanced MRI are able to quantify fibrosis as MRI can detect restricted movement of water that occurs in the expansion of extracellular fluid space in liver fibrosis. [21] showed this had a sensitivity of 85% and specificity of 100% for diagnosis of cirrhosis; as well as sensitivity of 89% and specificity of 80% to stage the degree of cirrhosis [22]. However, despite its strengths, the use of CT and MRI are limited in the clinical setting because they subject patients to ionizing radiation and intravenous contrast material, can significantly increase the cost of the procedure, and MRI-based techniques in particular are subject to limited availability and degree of technical expertise [23].

Radiological diagnosis of HCC

Focal liver lesions seen on ultrasonography in both cirrhotic and non-cirrhotic patients are concerning for HCC. As per the European Association for the Study of the Liver (EASL) Clinical Practice Guidelines, HCC surveillance in at-risk patients consists of six-monthly abdominal ultrasounds. This time interval is based on the expected tumour growth rate supported by observational data, and therefore the interval is not shortened for people at higher risk of HCC. This surveillance has shown a reduction in disease-related mortality with a meta-analysis including 19 studies showing that ultrasound surveillance detected the majority of HCC tumours before they presented clinically, with a pooled sensitivity of 94% [24].

The detection of nodules in cirrhotic patients always warrants further diagnostic contrast-enhanced imaging because benign and malignant nodules are not able to be differentiated based on ultrasonographic appearance alone, with current guidelines recommending CT or MRI to further characterise lesions 1 cm or greater identified on surveillance US [25]. Also, US has a low sensitivity of 63% for detecting early-stage HCC, particularly in the instance of very coarse liver echotexture in advanced fibrosis; and therefore, the performance of US in identification of small nodules in these cases is highly dependent on operator expertise, patient factors (for example, obesity) and the quality of equipment. Reported specificity for detecting HCC with US at any stage is uniformly high at > 90% [24,25,26,27,28]. Ultrasound-based elastography has also been studied for the evaluation of focal liver lesions, but because of its limitations with restricted depth of penetration and inability to differentiate between stiffness of benign and malignant tissue, it is not recommended for this use.

Contrast-enhanced radiological diagnosis of HCC is based on vascular phases (that is, lesion appearance in the late arterial phase, portal venous phase, and the delayed phase). The typical hallmark of HCC is the combination of hypervascularity in the late arterial phase and washout on portal venous and/or delayed phases, which reflects the vascular derangement occurring during hepatocarcinogenesis [29]. Both CT and MRI are more sensitive than ultrasound for detecting HCC < 2 cm and thus more likely to identify candidates for liver transplantation therapy [30]. As expected, in most studies, MRI has higher sensitivity compared to CT in HCC diagnosis which does vary according to HCC size, with MRI performing better on smaller lesions (sensitivity of 48% and 62% for CT and MRI, respectively, in tumours smaller than 20 mm vs. 92% and 95% for CT and MRI, respectively, in tumours equal or larger than 20 mm) [31, 32]. CT or MRI can be considered when patient factors such as obesity, severe parenchymal heterogeneity from advanced cirrhosis, intestinal gas and chest wall deformity prevent adequate US assessment [24].

There is a considerable false positive rate with CT or MRI that triggers further cost-ineffective investigations [24]. These imaging modalities also involve the use of contrast and repeated surveillance would result in accumulated exposure to radiation; and incur higher costs. Therefore, CT and MRI are not recommended for routine surveillance. Contrast-enhanced ultrasound (CEUS) in the delayed phase can be used to detect HCC. A recent meta-analysis showed pooled sensitivity and specificity of CEUS for the diagnosis of HCC at 85% and 91% respectively, with these values being almost comparable with MRI and CT for HCC nodules larger than 2 cm. However its use in surveillance has not been validated and is therefore, currently not recommended. It is important to note that CEUS only allows for one or a limited number of identified nodules as it cannot image the entire liver during the multiple phases of contrast administration [27, 33]. In Fig. 1, the diagnostic workflow of HCC is shown.

Fig. 1
figure 1

HCC is most common type of primary liver cancer. Alpha-fetoprotein (AFP) is one of the most widely used biomarkers for HCC screening, diagnosis, and prognosis of liver diseases

Alfa faeto protein (AFP) as a tumour marker for HCC has insufficient sensitivity and specificity for tumour detection when used alone. This is because fluctuating levels of AFP in cirrhotic patients can not only indicate HCC development but may also reflect exacerbation of underlying liver disease or flares of HBV or HCV infection. Also, only a small proportion of tumours at an early stage (10-20%) present with elevated AFP. However, when combined with US surveillance, AFP serum levels’ sensitivity for diagnosing early-stage HCC is significantly higher, 45% when using US alone versus 63% when using US and AFP [34, 35]. The decision to perform a liver biopsy is made on a case-by-case basis. Generally, biopsy is indicated when the imaging-based diagnosis remains inconclusive, but the malignancy is considered probable. As per the EASL Clinical Practice Guidelines, in non-cirrhotic patients, imaging alone is not considered sufficient and tissue assessment is required to establish diagnosis [25]. In Table 1, typical radiological findings of cirrhosis, portal hypertension, and HCC are provided.

Table 1 Typical radiological findings of liver disease

Artificial intelligence, machine learning, and deep learning

The term Artificial Intelligence (AI) is an umbrella term and refers to a suite of technologies in which computer systems are programmed to exhibit complex behaviour that would typically require intelligence in humans or animals [36]. The long overarching goal of AI is to enable machines to perform intellectual tasks such as decision making, problem solving, perception and understanding human communication, inspired by the human cognitive function.

Machine Learning (ML), a subset of AI, provides systems the ability to automatically learn and improve from experience without being explicitly programmed [37]. The conventional ML pipeline includes the steps of pre-processing, feature extraction, classification, and evaluation. As medical imaging datasets often have variations in characteristics such as contrast, resolution, orientation, side-markers, and noise, it is important to apply pre-processing techniques to improve the dataset quality. After data cleaning, a relevant region-of-interest (ROI) is selected using either fully-automatic segmentation, semi-automatic segmentation, or manual delineation by experts. After this step, salient features specific to the pattern of a particular medical condition are extracted, in a feature extraction step. Once features are extracted, ML algorithms are applied to map extracted features to the target task, such as classification. Overall, ML allows machines to learn from a set of data and subsequently make predictions on a new test data. Applications of ML in medical imaging date back to the early 1980 s when computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems were developed [38]. These CAD(e/x) systems were based on a pre-defined set of explicit parameters, features, or rules developed from expert knowledge. However, one of the major limitations of classical ML systems is the need for handcrafted feature engineering, which is subjective, requires domain expertise and is time-consuming and often brittle.

Deep Learning (DL) [39], a subset of ML, uses multiple layers of neural networks to progressively extract higher-level features from the raw input, overcoming the limitations of hand-crafted feature engineering in classical ML systems. In DL, layers of neural networks are stacked in an hierarchy with increasing complexity and abstraction to obtain high-level representation of the raw data. DL-based models have demonstrated state-of-the-art performance on variety of tasks in various fields such as computer vision, natural language processing, speech, and medical imaging. The success of deep learning is attributed to the availability of large-scale annotated datasets, enhanced computing power with the rise of graphics processing units (GPUs), and novel algorithms and architectures.

In the following subsections, we provide an overview of traditional CAD systems using machine learning and deep learning for diagnosing liver diseases from US images.

Machine learning based CAD systems

Before the rise of deep learning, classical machine learning based Computer-Aided Detection and Diagnosis (CAD(e/x)) systems involved a pipeline of handcrafted feature extraction and a trainable classifier. The machine learning enabled CAD(e/x) systems assist radiologists in image interpretation, disease detection, segmentation of regions-of-interest (ROI) such as tumours, and statistical analysis of the extracted tumour. In addition to their use in diagnostics, CAD systems are integrated in the clinical workflow to triage and prioritise patients based on urgency, in turn maximising operational performance. A typical CAD system follows a standard pipeline consisting of the following four steps:

  1. 1.

    Image pre-processing: Ultrasound (US) images are often of poor quality due to low contrast and presence of speckle noise during image acquisition. The goal of the pre-processing step is to reduce noise, enhance image quality, and standardise the data when it is acquired from multiple sources. Various denoising algorithms such as mean filter, median filter, bilateral filter, and Gaussian filters are applied to remove noise. In order to delineate certain regions, edges are also enhanced using unsharp masking and in the frequency domain. In order to improve contrast, methods such as histogram equalisation and more robust methods such as Contrast Limited Adaptive Histogram Equalisation (CLAHE) are applied. Finally, dataset is normalised using the mean and standard deviation of pixel values. In Table 2, an overview of various pre-processing methods in the reviewed studies to remove noise and improve the image quality is provided. The pre-processing is a critical step in obtaining consistent features and robust model performance.

  2. 2.

    Image segmentation: The goal of segmentation is to define the region of interest (ROI) or the volume of interest (VOI) on medical images, that contain the area or volume of the given lesion or structure. Given that normal and abnormal anatomical structures alone do not form the complete image, it is important to segment the image into foreground and background so that ROI can be extracted. ROI selection helps to reduce the computational cost as computing features from ROI is more efficient compared to using the complete image. In the studies reviewed, a variety of segmentation methods are applied, ranging from fully-automatic segmentation algorithms, semi-automatic segmentation algorithms with seed provided by an expert, and manual segmentation for ROI selection. One of the popular semi-automatic methods is seeded region growing, in which an expert provides an initial seed point and the algorithm automatically finds the contour by growing over the ROI. Fully-automatic algorithms include active contour or snakes.

  3. 3.

    Feature extraction and selection: The goal of the feature extraction step is to analyse selected ROI for special characteristics that can discriminate disease patterns from normal patterns. It extracts certain characteristic attributes and generates a set of meaningful descriptors from an image. The feature extraction step helps to obtain various quantitative measurements in selected ROI images which helps in decision making with respect to the pathology of a structure or tissue. Common visual features include colour, shape, and texture. Since medical images have homogeneous regions with little colour or intensity variations, shape and texture are more informative features. For US images, the most common characteristics include morphological features, gray-level features, and texture features. Feature extraction can be carried out in the spatial domain or frequency domain. In Table 3, an overview of various feature extraction algorithms used on most of the articles reviewed in this study is provided. After relevant features are extracted, a subset of the most relevant features are selected using feature selection algorithms. The feature selection step reduces the number of features by removing irrelevant and redundant features, and improves classification performance. The goal of feature selection is to reduce the dimensionality by removing less relevant features, in turn improving the classification accuracy. Common feature selection algorithms used in the reviewed studies include Principal Component Analysis (PCA), Analysis of Variance (ANOVA), and Locality Sensitive Discriminant Analysis (LSDA). In Table 4, an overview of various feature selection algorithms used in the reviewed studies is provided.

  4. 4.

    Classification: Classification is the process of categorising items into pre-selected classes or categories of similar type. The items are categorised into classes based on a similarity defined by some distance measure. Most of the popular classification methods for CAD of liver diseases include Naive Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), and Discriminant Analysis. Ensemble learning methods such as Random Forests (RF) are also used to leverage the benefit of multiple classifiers and improving classification performance. Table 5 provides an overview of various machine learning classifiers used in the reviewed studies.

Table 2 Brief overview of various pre-processing methods to remove noise and enhance the image quality
Table 3 Brief overview of various feature extraction methods
Table 4 Brief overview of various feature selection algorithms
Table 5 Brief overview of various machine learning algorithms

Deep learning

Deep learning (DL) [39] is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. DL uses multiple layers to represent data abstractions to build computational models. DL has shown high levels of performance on various complex tasks such as speech recognition [46], machine translation [47], object detection [48], caption generation [49], and visual question answering [50]. Some key deep learning algorithms include convolutional neural networks [51], recurrent neural networks [52], and generative adversarial networks [53]. In the following subsection we provide an overview of the building blocks of a typical convolutional neural network, which is one of the de-facto deep learning algorithms for processing 2D (images) or 3D (volumetric) data.

Convolutional neural networks

Convolutional Neural Networks (CNNs) are a special type of deep neural networks that are good at handling two-dimensional data such as images or three-dimensional data such as videos. CNNs have been successfully applied in medical imaging problems such as skin cancer, arrhythmia detection, fundus image segmentation, thoracic disease detection, and lung segmentation. CNNs consists of multiple layers stacked together which use local connections known as local receptive field and weight-sharing for better performance and efficiency. A typical CNN architecture consists of the following layers:

  • Convolutional layer: The convolutional layer is the core building block of a CNN which uses the convolution operation in place of general matrix multiplication. Its parameters consist of a set of learnable filters, also known as kernels. The main task of the convolutional layer is to detect features within local regions of the input image that are common throughout the dataset and map their appearance to a feature map. The output of each convolutional layer is fed to an activation function to introduce non-linearity. There are a number of activation functions available such as Rectified Linear Unit (ReLU), Sigmoid, etc.

  • Sub-sampling (Pooling) layer: In CNNs, the sequence of convolutional layer is followed by pooling layer which reduces the spatial size of the input and thus reduce the number of parameters of the network. A pooling layer takes each feature map output from the convolutional layer and down samples it. In other words, the pooling layer summarises a region of neurons in the convolution layer. The most common pooling techniques are max pooling and average pooling. Max pooling takes the largest value from a patch of the feature map, whereas average pooling takes the average of each patch for the feature map.

  • Activation function: The activation function refers to the features of activated neurons that can be retained and mapped out by a non-linear function, which can be used to solve non-linear problems. Common activation functions include sigmoid, tanh, ReLU, and Softmax. ReLU is one of the widely used activation function as it overcomes the vanishing gradient problem in deep neural networks.

  • Batch normalisation: Batch normalisation is used to address the issues related to internal covariance shift within feature maps. Internal covariance shift is a change in the distribution of hidden units’ values, which slows down the convergence and requires careful initialisation of parameters. Batch normalisation normalises the distribution of feature maps by setting them to zero mean and unit variance. It also makes the flow of gradients smooth and acts as a form of regularisation, helping the generalisation power of the network.

  • Dropout: Dropout is a regularisation techniques heavily used in convolutional neural networks. In dropout, some units or connections are randomly dropped (skipped) with a certain probability. Due to multiple connections, a neural network co-adapts by learning non-linear relations. Dropout helps to overcome this co-adaptation by randomly dropping some of the connections or units, preventing the network from overfitting on the training data.

  • Fully connected layer: In fully connected layers, each neuron from the previous layer is connected to every neuron in the next layer and every value contributes to predicting the class of the test sample. The output of the last fully connected layer is passed through an activation function, generally softmax, which outputs the class scores. Fully connected layers are mostly used at the end of the CNN for the classification task.

Representative convolutional neural networks

The history of convolutional neural networks dates back to LeNet-5 [54] which was proposed for digit recognition. Due to the lack of computational resources at that time, LeNet usage was limited. The LeNet-5 model uses tanh as a non-linear activation function followed by a pooling layer and three fully connected layers. It was the AlexNet [51] model, which made a major breakthrough by drastically reducing the top-5 error rate on the ImageNet challenge compared to the previous shallow networks. Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on the ImageNet. In AlexNet, tanh is replaced by rectified linear units (ReLU) and the dropout technique is used to selectively ignore units to avoid overfitting of the model. In order to boost predictive performance, Visual Geometry Group at Oxford developed VGG-16 [55]. VGG increased the requirements of memory and computational power because of increased depth of the network with 16 layers combined with convolution and pooling layers. In order to limit memory requirements, various structural or topological decompositions were applied which led to more powerful models such as GoogleNet [56], and Residual Networks (ResNet) [57]. GoogleNet uses an Inception module which computes \(1 \times 1\) filters, \(3 \times 3\) filters and \(5 \times 5\) filters in parallel, but applies bottleneck \(1 \times 1\) filters to reduce the number of parameters. Further changes were made to the original Inception module by removing the \(5 \times 5\) filters and replacing them with two successive layers of \(3 \times 3\) filters, which is called Inception v2. Szegedy et al. [56] released the Inception v3 model where depth, width and number of features are increased systematically by increasing the feature maps before each pooling layer. ResNet was the first network having more than 100 layers using an idea similar to Inception v3. In ResNet, the output of two successive convolutional layers and input bypassing the two layers are combined, which act very similar to a Network-in-Network. ResNet further increases predictive performance by leveraging rich combinations of features, but keeping the computation low. In Table 6, a summary of some of the most common representative CNN models is provided.

Table 6 Representative CNN architectures and their high-level description

CNN training strategies

In this section, we highlight various strategies which are helpful in training deep convolution neural networks and improving their performance.

  • Transfer learning: Transfer learning [59] refers to the ability to share and transfer knowledge from a source task to the target task. Convolutional neural networks learn features in an hierarchical manner, whereby early layers learn generic image features such as edges and corners, whereas later layers learn features specific to the dataset. Given that it is challenging to obtain large-scale annotated datasets in the medical domain due to cost and time constraints, transfer learning helps to leverage the learning of models trained on large-scale datasets such as the ImageNet [60].

  • Data augmentation: Current state-of-the-art CNNs need large-scale annotated data to train in a supervised manner. Given the complexity of CNN models, it is easy for them to overfit on small size medical imaging datasets. Data augmentation [61] is a technique to generate synthetic data, for example by applying different affine transformations such as rotation, scaling, translation, flipping, and adding noise. Data augmentation not only increases the dataset size during training, but also adds diversity to the data, making the model robust on unseen data.

Evaluation measures

For evaluating any model, precision, recall, F1-measure, and accuracy scores are computed using the confusion matrix.

  • True Positive (TP): If a person having Cirrhosis is detected as Cirrhosis

  • True Negative (TN): If a person not having Cirrhosis is correctly detected as non having Cirrhosis

  • False Positive (FP): If a healthy person is detected positive for having Cirrhosis

  • False Negative (FN): If a person having Cirrhosis is detected as a healthy one.

  • Precision calculates the fraction of correct positive detection of Cirrhosis.

  • Recall measures how good all the positives are, which depends on the percentage of total relevant cases correctly classified by the model. It is also called sensitivity.

  • F1-measure is the harmonic mean between precision and recall.

For a binary classification task, the confusion matrix is a \(2 \times 2\) table reporting four primary parameters known as False Positives (FP), False Negative (FN), True Positives (TP), and True Negatives (TN).

$$\begin{aligned} \text {Accuracy} = \frac{\text {TP} + \text {TN}}{\text {TP} + \text {FP} + \text {TN} + \text {FN}} \end{aligned}$$
(1)
$$\begin{aligned} \text {Precision} = \frac{\text {TP}}{\text {TP} + \text {FP}} \end{aligned}$$
(2)
$$\begin{aligned} \text {Recall} = \frac{\text {TP}}{\text {TP} + \text {FN}} \end{aligned}$$
(3)
$$\begin{aligned} \text {F-measure} = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(4)

Receiver Operating Curve (ROC) is a 2D graphical plot between the True Positive Rate (Sensitivity) and the False Positive Rate (Specificity). The ROC represents the trade-off between sensitivity and specificity. The Area Under the ROC Curve (AUC) represents a measure of how well a model can discriminate between patients with liver diseases and a healthy group of individuals.

Contributions

This article aims to characterise diagnosis, staging, and surveillance of liver diseases using medical imaging, machine learning and deep learning techniques through a methodical review of the literature. We seek to answer the following research questions:

  1. 1.

    Methods \(\longrightarrow\) What AI methods are being applied to the diagnosis and staging of liver diseases using ultrasound imaging?

  2. 2.

    Datasets \(\longrightarrow\) What are different sources of publicly available datasets?

  3. 3.

    Scope \(\longrightarrow\) What types of problems are addressed and solved using AI in diagnosing liver diseases?

  4. 4.

    Performance \(\longrightarrow\) How well do AI techniques including machine learning and deep learning perform in terms of diagnostic accuracy?

To answer these questions and draw our insights, we methodically studied 77 articles from a variety of publication venues, mostly published between January 2010 to December 2021. There has been surveys related to liver diseases [62,63,64,65,66,67]. The survey by [62] mostly focused on diffuse liver diseases and cover only conventional CAD systems. [63] focused on radiographic features under different medical imaging modalities for diagnosing liver diseases. Similar to [62, 64, 65] focused on conventional ML pipeline for diagnosing liver lesions using US imaging. Although [66] provided details about both machine learning and deep learning models for diagnosing liver diseases using US imaging, they did not follow a systematic approach. In Table 7, current study is compared to the existing surveys. The current study provides a more detailed and systematic approach of the current state-of-the-art ML and DL approaches for diagnosing liver diseases. We also provide details about the datasets, methods of severity scoring, and professional societies guidelines. We close the review by discussing the limitations of existing studies and noting future research directions to further improve diagnostic performance, expediting clinical workflow, augmenting clinicians in their decision making, and reducing healthcare cost.

Table 7 Comparison of review articles related to our survey paper, with their methods and scope

The review is structured as follows:

  • Sect. 7 provides search strategy in terms of selected databases, inclusion and exclusion criteria, and keywords related to search query.

  • Sect. 8 provides a systematic review of diagnosing liver diseases using ultrasound imaging.

  • Sect. 9 provides an overview of various public datasets for the diagnosis of liver diseases.

  • Sect. 10 provides current limitations and future research directions.

Search strategy

We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [68] to perform our review.

Data sources and search queries

We conducted a comprehensive search to identify potentially all relevant publications on the application of AI including machine learning and deep learning to the diagnosis of liver diseases using medical imaging. The Web of Science, Scopus, IEEE Xplore, and the ACM digital library were queried for articles indexed from Jan 2010 up to 31 December 2021. We included articles written in English and excluded those in the form of editorial, erratum, letter, note, or comment. In Table 8, our inclusion and exclusion criteria are provided. We first identified keywords and their associations to form our search query. For ease of search, we divided our keywords based on four main concepts. The first concept refers to keywords related to liver diseases such as chronic liver disease(s), acute liver disease(s), liver lesion(s), nonalcoholic fatty liver disease, and hepatocellular carcinoma. The first row in Table 9 shows all keywords relevant to the first concept. The second concept relates to various tasks such as classification, detection, segmentation, and staging of liver diseases. The second row in Table 9 shows all keywords for tasks relevant to the second concept. The third concept relates to various imaging modalities by which liver diseases are diagnosed. These include ultrasound, contrast-enhanced ultrasound, computed tomography, and magnetic resonance imaging. The third row in Table 9 shows various keywords and their abbreviations for concepts related to imaging modalities. The fourth concept belongs to keywords related to computer applications such as computer-aided diagnosis, machine learning, and deep learning. For each concept we included associated keywords as well as their abbreviations to make the search criteria complete. The final query is the logical AND of all the four concepts. A complete picture of concepts,keywords related to each concept, and the final search query on databases is presented in Table 9.

Table 8 List of inclusion and exclusion criteria
Table 9 Search query related to four main concepts which are combined to formulate final query for four databases

Article selection

The search query retrieved in total 1,878 studies (Web of Science: 455, Scopus: 543, IEEE Xplore: 786, and ACM digital library: 94). We created an EndNote library for our screening process. We first used “Find Duplicates" function in EndNote to find any potential duplicate studies. The software highlighted 356 duplicate studies which we removed, with a total of 1,700 studies left. Given the limited functionality of the in-built EndNote function for finding duplicates, we then manually removed duplicates. After the manual duplicate removal step, we were left with a total of 1,494 studies. We started our first screening step based on article title and abstract. We found that there were many studies which are for other human organs but using one of the imaging modalities for diagnosis. After our first screening, we were left with 608 relevant studies. In the second screening step, we read the full-texts of articles and found studies on animals (3), other topic (30), review papers (42), and studies focusing on specific medical condition (9). After excluding all these studies, we were left with a total of 524 studies after our second screening. In the third screening step, we separated studies based on imaging modalities. Out of 524 studies, 243 studies belonged to CT, 81 studies to MRI, and 147 studies to US. However, 53 studies used liver biopsy, blood bio-marker, urine bio-marker, and diffraction enhanced imaging for diagnosing liver diseases, In the final stage of our screening, we reviewed 147 liver US studies and assessed the quality based on the questions in our quality assessment given in Table 10. After the final quality assessment, we were left with a total of 62 studies. During our detailed review, we also went through the references of these 62 studies and found a few relevant studies which our search query was unable to find, to obtain a total of 77 studies. We have reviewed the 77 studies in detail and provide a detailed discussion of the applications, methodology and results. In Fig. 2, the flowchart of our complete article selection process following the PRISMA guidelines is shown. Finally, we conducted a methodical review and qualitative analysis of the 77 studies in accordance with our inclusion and exclusion criteria. The article selection was performed by the first author (S.S.) and was agreed by all other authors of this article.

Due to the heterogeneity and multidisciplinary nature of the included studies, a formal meta-analysis is not possible. We did, however, visually determine overall performance by representing different performance metrics including sensitivity, specificity, F1-score, and receiver operating characteristic (ROC) curves, which are presented later.

Table 10 Quality assessment questions
Fig. 2
figure 2

PRISMA flowchart for including articles in our study. The flow diagram depicts the flow of information through the different phases of the methodical review. It maps out the number of studies identified, included and excluded, and the reasons for exclusions

Review of studies on the diagnosis and staging of liver diseases

In this section, we review selected studies based on the disease of interest. In Table 11, a summary on studies for fibrosis classification or staging is provided. Most of these studies extracted textural features and applied conventional machine learning algorithms for classification. A few of studies [69, 70] performed fusion of multiple ultrasound modalities to improve diagnostic performance on fibrosis staging. In Table 12, a summary of studies on cirrhosis classification is provided. The focus is on separating normal cases from Cirrhosis. In terms of methods, studies have applied both conventional machine learning and deep learning methods. Table  13 provides summary of studies focusing on nonalcoholic fatty liver disease diagnosis. Most of the studies applied a combination of texture feature extraction algorithms and conventional ML classifiers. Table 14 provides summary of classification of various chronic liver diseases. In this setup, studies focus on three class classification (Normal vs. NAFLD vs. Cirrhosis) or four class classification (Normal vs. Steatosis vs. Fibrosis vs. Cirrhosis). One study [71] performed six class classification with six labels namely, normal, steatosis, chronic hepatitis without cirrhosis, compensated cirrhosis, decompensated cirrhosis, and HCC. In Fig. 3, an year-wise number of studies is shown. The plot shows an increasing trend in the number of studies over years. In Fig. 4, we provide number of studies applying ML or DL methods for diagnosing liver diseases. The plot shows that machine learning was a de-facto choice before 2017. However, with the rise of deep learning, there has been a sharp increase in the number of studies using deep learning methods. In Fig. 5, the distribution of studies is given based on various applications. Of the 77 studies covered in this review, the distribution is as: Fibrosis classification (n=10), Cirrhosis classification (n=7), NAFLD (n=22), CLD (n=8), FLL (n=24), and HCC diagnosis and prognosis (n=6). Most of these studies focus on the classification problem considering one disease versus rest of the diseases.

The results of studies in Table 14 show that as the number of classes increase, the performance of models degrade. This is due to the correlation between diseases and overlapping biomarkers of diseases. In Table 15, a summary of studies focusing on different liver lesions and hepatocellular carcinoma is presented. A major observation is that most of the studies focused on small private (in-house) datasets containing only a few samples of liver lesions such as hemangioma (HEM) or metastases (MET). The studies provide quite high classification performance in terms of accuracy scores. However, accuracy is not a robust metric especially when working with imbalanced datasets. Finally, Table 16 provides summary of studies focusing on HCC prognosis. Most of these studies focused on patient survival analysis. Few of the recent studies showed that CNNs outperforms classical ML algorithms for HCC diagnosis and prognosis.

Fig. 3
figure 3

Number of studies per year. In total there were 77 selected studies that meet our selection criteria

Fig. 4
figure 4

Number of studies applying ML (Machine Learning) and DL (Deep Learning) methods

Fig. 5
figure 5

Distribution of studies application wise. NAFLD classification (n=22/28.57%), CLD Classification (n=8/10.38%), Fibrosis classification (n=10/12.98%), Cirrhosis classification (n=7/9.09%), FLL classification (n=24/31.16%), and HCC prediction (n=6/7.79%)

Table 11 Summary of studies on Fibrosis classification or staging using liver US images
Table 12 Summary of studies on Cirrhosis classification using liver US images
Table 13 Summary of studies on NAFLD classification using liver US images
Table 14 Summary of studies on CLD classification using liver US images
Table 15 Summary of studies on Focal Liver Lesions (FLLs) classification using liver US images
Table 16 Summary of studies on HCC diagnosis and prognosis using liver US images

Public datasets and online initiatives for the diagnosis of liver diseases

Most of the reviewed studies in this article use private (in-house) datasets. Recently, the research community has felt the need to release benchmark datasets in the public domain so that computational methods can be fairly compared. The ImageNet dataset [60] has been one of the underlying factors in the success of deep learning in computer vision because it enabled targeted progression and objective comparison of methods proposed by the community around the world. With the same motivation, researchers in the field of biomedical image computing have started sharing their curated datasets publicly to advance research and foster fair comparison of methods. We provide an overview of various liver datasets below:

  • B-mode fatty liver ultrasound: [103] released a B-mode US dataset for the diagnosis of NAFLD steatosis assessment using ultrasound images. It contains 550 B-mode ultrasound scans and the corresponding liver biopsy results. The dataset was collected from 55 subjects admitted for bariatric surgery in the Department of Internal Medicine, Hypertension and Vascular Diseases, Medical University of Warsaw, Poland.

  • SYSU-CEUS: The SYSU-CEUS dataset [157] contains 353 CEUS videos of three types of focal liver lesions, namely, 186 instances of Hepatocelluar carcinoma (HCC), 109 instances of Hemangioma (HEM), and 58 instances of Focal Nodular Hyperplasia (FNH). Datasets specific to liver tumours have also been made available through participation in online challenges.

  • LiTS: The Liver Tumor Segmentation Challenge(LiTS) [158] dataset provides 201 contrast-enhanced 3D abdominal CT scans and segmentation labels for liver and tumour regions. Each slice of the volume has a resolution of 512 x 512 pixels. Out of 231 volumes, 131 carry their respective annotations whereas no ground-truth labels are provided for the test set containing 70 volumes. The in-plane resolution ranges from 0.60 mm to 0.98 mm, and the slice spacing from 0.45 mm to 5.0 mm.

  • SLIVER07: The Segmentation of the Liver 2007 (SLIVER07) [159] dataset is a part of the grand-challenge organised in conjunction with the MICCAI 2007 for liver tumor segmentation. The training data consists of 10 tumors from 4 patients with their ground-truth segmentations. For the test set, consisting of 10 tumors from 6 patients, the ground-truth was not made available to public by the task organisers. The dataset contains liver tumor CT images corresponding to portal phase of a standard four-phase contrast enhanced imaging protocol.

  • 3D-IRCADb: The 3D Image Reconstruction for Comparison of Algorithm Database (3D-IRCADb) Footnote 1 consists of 3D CT scans of 10 men and 10 women having liver tumors in 15 of the cases. The anonymised patient image, labelled image corresponding to ROI segmented, and mask images are given in the DICOM format. The in-plane resolution ranges from 0.57 mm to 0.87 mm, and the slice spacing from 1.6 mm to 4.0 mm.

  • CHAOS: The Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge (CHAOS) [160] is an IEEE ISBI 2019 challenge dataset focused on segmentation of healthy abdominal organs from CT and/or MRI. The CHAOS dataset contains abdominal CT of 40 subjects having healthy liver. Each slice has a resolution of 512 x 512 pixels.

  • Multi-organ abdominal CT reference standard segmentation: The Multi-organ Abdominal CT Reference Standard Segmentation dataset [161] comprises 90 abdominal CT images delineating multiple organs such as spleen, left-kidney, gallbladder, esophagus, liver, stomach, pancreas, and duodenum. The abdominal CT images and some of the reference segmentations are from two datasets: The Cancer Image Archieve (TCIA) Pancreas-CT dataset [162] and the Beyond the Cranial Vault (BTCV) abdominal dataset [163]. The segmentation of various organs across these CT volumes was performed by two experienced undergraduate students and verified by a radiologist on a volumetric basis.

  • DeepLesion: The DeepLesion [164] dataset, released by the National Institute of Health (NIH) consists of more than 32,000 annotated lesions identified on CT images, collected from 4,400 unique patients. Each of the 2D CT scans is annotated with lesion type, bounding box, and metadata. Each images has a resolution of 512 x 512 pixels.

  • MIDAS: The MIDAS liver tumor dataset from the National Library of Medicine (NLM)’s Imaging Methods Assessment and Reporting project provides 4 liver tumors from 4 patients with five expert hand segmentations. The dataset was made available by Dr. Kevin Cleary at the Imaging Science and Information Systems, Georgetown University Medical Center.

  • CLUST: The Challenge on Liver Ultrasound Track (CLUST) [165] provides a dataset for automatic tracking of liver in ultrasound volumes. The dataset consists of 86 independent studies, with 64 (2D + t) and 22 (3D + t) studies. The dataset was split into training (40% of all sequences) and testing set (60%) from the complete dataset. Annotations were provided for the training set but no ground-truth provided for the test set.

Limitations and future directions

In this section, we outline limitations of existing studies based on our extensive literature review. We also propose future research directions to overcome these limitations.

Limitations

  • Focus on classification Most of the studies focused on the classification task, i.e., binary classification such as: Normal vs. Fatty, Normal vs. Fibrosis, Normal vs. Cirrhosis or multi-class classification such as: Normal vs. Fibrosis vs. Cirrhosis, Normal vs. Hepatocellular Carcinoma vs. Metastasis vs Hemangiomas. However, very little work has been done on disease progression and severity scoring for diseases.

  • Small in-house datasets Although there has been a lot of work on diagnosing liver diseases using various imaging modalities such as ultrasound, CT, and MRI, there are only a few publicly available datasets. Most studies in our literature review worked on in-house data which are often of small size. Also, these datasets suffer from a class-imbalance problem. As accuracy score is not a reliable metric on imbalanced datasets, the lack of publicly available benchmark datasets often limits the true assessment of proposed algorithms in these studies.

  • Classical CAD system still prevalent Currently, deep learning has shown tremendous performance improvement across various fields such as computer vision, natural language processing, robotics, and biomedical image processing. Specifically in biomedical image computing, deep learning has shown superior results on various tasks such as classification, segmentation, and tracking. Due to the lack of publicly available large-scale annotated datasets on liver US studies, the classical machine learning pipeline is still prominent in the community.

Future research direction

  • Need for multidisciplinary approach: The management of HCC encompasses multiple disciplines including hepatologists, diagnostic radiologists, pathologists, transplant surgeons, surgical oncologists, interventional radiologists, nurses, and palliative care professionals [166]. A study by [167] showed that the development of true multidisciplinary clinic with a dedicated tumour board review for HCC patients increased survival; due to improved staging and diagnostic accuracy, efficient treatment times and increased adherence to clinical diagnostic and therapeutic guidelines. Therefore, the AASLD recommends referring HCC patients to a centre with multidisciplinary clinic.

  • Make use of multi-modal data: Current state-of-the-art deep learning models when trained on multi-modal data such as B-mode images, Doppler images, contrast-enhanced ultrasound images, and SWE images could improve the early staging and diagnosis of HCC. Multi-modal data can provide complementary information, in turn helping models to improve.

  • Need for benchmark datasets: In order to push the community’s effort in improving diagnostic performance by proposing novel methods, there is a need to establish a benchmark environment with the release of a large-scale annotated dataset in the public domain. Similar to benchmark algorithms during challenges, the task organisers can release annotated training and validation data but not release test data labels. Once participants have fine-tuned their methods, they may submit predictions on the test set on the challenge evaluation server.

Conclusion

HCC-related morbidity and mortality continues to up-trend due to delays in diagnosis and treatment as early disease is often asymptomatic. Ultrasound is the recommended first-line imaging modality for diagnosis of chronic liver disease and to screen for HCC; however, contrast-enhanced studies are required to confirm HCC diagnosis. In this paper, we first provide an overview of current diagnostic methods for stages of liver disease. We then lay the foundation of methods such as Image pre-processing, feature extraction, and classification for classical machine learning algorithms and a brief overview of convolutional neural networks, which are specialised deep learning algorithms for processing 2D or 3D data. Then, we reviewed the use of these methods as diagnostic tools in chronic liver disease and HCC. We also discussed the studies reviewed in the survey. Finally, we provide future research directions in assisting diagnostic accuracy and efficiency in clinical workflow. We believe that by adapting AI technologies into medical radiology, diagnostic imaging tools have the potential to be implemented in first-line management of chronic liver disease and HCC.