Radiological Diagnosis of Chronic Liver Disease and Hepatocellular Carcinoma: A Review

Medical image analysis plays a pivotal role in the evaluation of diseases, including screening, surveillance, diagnosis, and prognosis. Liver is one of the major organs responsible for key functions of metabolism, protein and hormone synthesis, detoxification, and waste excretion. Patients with advanced liver disease and Hepatocellular Carcinoma (HCC) are often asymptomatic in the early stages; however delays in diagnosis and treatment can lead to increased rates of decompensated liver diseases, late-stage HCC, morbidity and mortality. Ultrasound (US) is commonly used imaging modality for diagnosis of chronic liver diseases that includes fibrosis, cirrhosis and portal hypertension. In this paper, we first provide an overview of various diagnostic methods for stages of liver diseases and discuss the role of Computer-Aided Diagnosis (CAD) systems in diagnosing liver diseases. Second, we review the utility of machine learning and deep learning approaches as diagnostic tools. Finally, we present the limitations of existing studies and outline future directions to further improve diagnostic accuracy, as well as reduce cost and subjectivity, while also improving workflow for the clinicians.


Introduction
The delivery of quality healthcare is one of the primary agendas of every nation.Medical imaging includes techniques and processes designed to visualise body parts, tissue or organs for medical purposes, including both diagnostic and therapeutic.With recent advances in Artificial Intelligence (AI) and medical imaging technologies, biomedical image analysis has transformed clinical practice by providing improved insights into human anatomy and disease processes.
Liver disease progression can be characterised by histopathological and haemodynamic changes within the hepatic parenchyma, which correlates to signs found on imaging modalities.Liver fibrosis is the most common outcome of chronic liver injury.Persistent hepatic parenchymal damage results in activation of immune cells and synthesis of fibrotic extracellular matrix components leading to scar formation, which impairs cell function [1,2].Progressive liver fibrosis can lead to liver cirrhosis and related complications such as portal hypertension [3].Portal hypertension in turn leads to multiple complications including splenomegaly, ascites, varices, hepatorenal syndrome and hepatic encephalopathy.Further, the process of chronic liver injury eventually leads to hepatocellular carcinoma (HCC), with cirrhosis being the main precursor of HCC [4].Overall, one-third of cirrhotic patients will develop HCC during their lifetime.Risk factors for chronic liver disease and eventually liver cirrhosis include chronic infection with HBV or HCV, heavy alcohol intake, and metabolic liver disease [5].
According to World Health Organisation (WHO), HCC is the fourth-leading cause of cancer-related deaths in the world [6].The prognosis of patients with this tumour remains poor, with a 5-year survival rate of 19% at time of diagnosis [7].Unfortunately, this is because HCC is often diagnosed at its advanced stages due to the absence of symptoms in patients with early disease, and the poor adherence to surveillance in high-risk patients.The five-year survival rate for patients whose tumours are detected at an early stage and who 73 Page 2 of 33 receive treatment exceeds 70% [8].Therefore, early diagnosis and staging of liver diseases plays a pivotal role in reducing HCC-related deaths, as well as reducing healthcare costs.
Many computational methods have been developed for the radiological diagnosis of chronic liver disease and HCC.Among the various options, Machine Learning (ML) and Deep Learning (DL) methods have received significant attention due to their outstanding performance on disease diagnosis and prognosis.In this review, we aim to perform a comprehensive analysis of various ML and DL methods for the diagnosis of chronic liver disease and HCC.We first provide an overview of methods in the ML pipeline including pre-processing, feature extraction, and learning algorithms.We then provide an overview of Convolutional Neural Networks (CNNs), which are specialised deep learning algorithms for processing 2D or 3D data.We discuss in detail the application of various methods for liver diseases such as fibrosis, cirrhosis, and HCC.We further outline limitations in current studies and provide research directions that need attention from the scientific community.

Radiological diagnosis of fibrosis and cirrhosis
Ultrasound (US) is typically the first-line radiological study obtained in patients suspected of having cirrhosis because it is readily available, non-invasive, well-tolerated, less expensive than its CT or MRI counterparts, provides real-time image acquisition and display, and does not expose patients to the adverse effects of intravenous contrast or radiation.Changes in tissue composition in a cirrhotic liver can be detected on gray-scale US.The ultrasonographic hallmarks of cirrhosis are a nodular or irregular surface, coarsened liver edge and increased echogenicity; in advanced disease, the gross liver appears atrophied and multi-nodular (with typically atrophy of right lobe and hypertrophy of caudate or left lobes) [9].A prospective study of 100 patients with suspected cirrhosis who underwent liver biopsy showed that high-resolution US had 91% sensitivity and 94% specificity in detecting cirrhosis [10].In another similar study, hepatic surface nodularity, especially detected by a linear probe, was shown to be the most direct sign of advanced fibrosis, with reported sensitivity and specificity of 54% and 95% respectively [11,12].However, the disadvantages of US in diagnosis of cirrhosis includes high operator dependency and effect on resolution due to presence of speckle noise and fat in obese patients [13].
Ultrasonography also detects portal hypertension, which is a predictive marker of poor outcomes in cirrhosis, with reverse portal flow in decompensated cirrhosis being a poor prognostic marker [14].Recent studies have shown that HCC incidence increases in parallel to portal pressure.B-mode US signs suggestive of increased portal pressures include increased portal vein diameter, splenomegaly, ascites and presence of abnormal collateral route; US is able to detect the onset of these complications early.Doppler US has high specificity and moderate sensitivity for the diagnosis of clinically significant portal hypertension, but is limited in detecting slow blood flow and has reduced frame rate [15,16].
Ultrasound-based elastography is a radiological technique that is used as an alternative to liver biopsy to stage the degree of liver fibrosis.Shear wave elastography and strain elastography are the two main techniques used to evaluate liver stiffness, by essentially measuring the hepatic tissue response after mechanical excitation [17].The accuracy of elastography has primarily been investigated in patients with chronic HCV and HBV.Overall, it has an estimated sensitivity of 70% and specificity of 85% in diagnosing significant fibrosis (F greater than or equal to 2), and 87% and 91% respectively in cirrhosis (F4).A meta-analysis of 17 studies consisting of 7,058 patients has also shown that it can be used to predict complications in chronic liver disease patients, with baseline liver stiffness associated with risk of hepatic decompensation (relative risk [RR] 1.07, 95% CI 1.03 − 1.11), HCC development (RR 1.11, 95% CI 1.05 − 1.18), and death (RR 1.22, 95% CI 1.05 − 1.43) [18].However, it can be limited in its use in some cases where other factors affect measured liver stiffness, including elevated central venous pressures in patients with severe cardio-respiratory disease, obesity and anatomic distortion.
Given the high sensitivity of ultrasound diagnosis of cirrhosis, CT or MRI is not typically required for diagnosis.However, [12] did show that early stages of liver parenchymal abnormalities and morphological changes in the liver on MRI and CT were predictive of cirrhosis by multivariate analysis (the diagnostic accuracy being 66.0%, 71.9% and 67.9% for US, CT and MRI respectively).Furthermore, physiological parameters have been identified from measurements on multiphase CT as markers of fibrosis -for example, changes in liver perfusion, arterial fraction and mean transit time of contrast do correlate well with severity of cirrhosis (by Child-Pugh classification) [19,20].These techniques are yet to be validated in multi-centre trials and remain investigated at this stage.
Diffusion-weighted and contrast-enhanced MRI are able to quantify fibrosis as MRI can detect restricted movement of water that occurs in the expansion of extracellular fluid space in liver fibrosis.[21] showed this had a sensitivity of 85% and specificity of 100% for diagnosis of cirrhosis; as well as sensitivity of 89% and specificity of 80% to stage the degree of cirrhosis [22].However, despite its strengths, the use of CT and MRI are limited in the clinical setting because they subject patients to ionizing radiation and intravenous contrast material, can significantly increase the cost of the procedure, and MRI-based techniques in particular are subject to limited availability and degree of technical expertise [23].

Radiological diagnosis of HCC
Focal liver lesions seen on ultrasonography in both cirrhotic and non-cirrhotic patients are concerning for HCC.As per the European Association for the Study of the Liver (EASL) Clinical Practice Guidelines, HCC surveillance in at-risk patients consists of six-monthly abdominal ultrasounds.This time interval is based on the expected tumour growth rate supported by observational data, and therefore the interval is not shortened for people at higher risk of HCC.This surveillance has shown a reduction in disease-related mortality with a meta-analysis including 19 studies showing that ultrasound surveillance detected the majority of HCC tumours before they presented clinically, with a pooled sensitivity of 94% [24].
The detection of nodules in cirrhotic patients always warrants further diagnostic contrast-enhanced imaging because benign and malignant nodules are not able to be differentiated based on ultrasonographic appearance alone, with current guidelines recommending CT or MRI to further characterise lesions 1 cm or greater identified on surveillance US [25].Also, US has a low sensitivity of 63% for detecting early-stage HCC, particularly in the instance of very coarse liver echotexture in advanced fibrosis; and therefore, the performance of US in identification of small nodules in these cases is highly dependent on operator expertise, patient factors (for example, obesity) and the quality of equipment.Reported specificity for detecting HCC with US at any stage is uniformly high at > 90% [24][25][26][27][28]. Ultrasound-based elastography has also been studied for the evaluation of focal liver lesions, but because of its limitations with restricted depth of penetration and inability to differentiate between stiffness of benign and malignant tissue, it is not recommended for this use.
Contrast-enhanced radiological diagnosis of HCC is based on vascular phases (that is, lesion appearance in the late arterial phase, portal venous phase, and the delayed phase).The typical hallmark of HCC is the combination of hypervascularity in the late arterial phase and washout on portal venous and/or delayed phases, which reflects the vascular derangement occurring during hepatocarcinogenesis [29].Both CT and MRI are more sensitive than ultrasound for detecting HCC < 2 cm and thus more likely to identify candidates for liver transplantation therapy [30].As expected, in most studies, MRI has higher sensitivity compared to CT in HCC diagnosis which does vary according to HCC size, with MRI performing better on smaller lesions (sensitivity of 48% and 62% for CT and MRI, respectively, in tumours smaller than 20 mm vs. 92% and 95% for CT and MRI, respectively, in tumours equal or larger than 20 mm) [31,32].CT or MRI can be considered when patient factors such as obesity, severe parenchymal heterogeneity from advanced cirrhosis, intestinal gas and chest wall deformity prevent adequate US assessment [24].
There is a considerable false positive rate with CT or MRI that triggers further cost-ineffective investigations [24].These imaging modalities also involve the use of contrast and repeated surveillance would result in accumulated exposure to radiation; and incur higher costs.Therefore, CT and MRI are not recommended for routine surveillance.Contrast-enhanced ultrasound (CEUS) in the delayed phase can be used to detect HCC.A recent meta-analysis showed pooled sensitivity and specificity of CEUS for the diagnosis of HCC at 85% and 91% respectively, with these values being almost comparable with MRI and CT for HCC nodules larger than 2 cm.However its use in surveillance has not been validated and is therefore, currently not recommended.It is important to note that CEUS only allows for one or a limited number of identified nodules as it cannot image the entire liver during the multiple phases of contrast administration [27,33].In Fig. 1, the diagnostic workflow of HCC is shown.
Alfa faeto protein (AFP) as a tumour marker for HCC has insufficient sensitivity and specificity for tumour detection when used alone.This is because fluctuating levels of AFP in cirrhotic patients can not only indicate HCC development but may also reflect exacerbation of underlying liver disease or flares of HBV or HCV infection.Also, only a small proportion of tumours at an early stage (10-20%) present with elevated AFP.However, when combined with US surveillance, AFP serum levels' sensitivity for diagnosing earlystage HCC is significantly higher, 45% when using US alone versus 63% when using US and AFP [34,35].The decision to perform a liver biopsy is made on a case-by-case basis.Generally, biopsy is indicated when the imaging-based diagnosis remains inconclusive, but the malignancy is considered probable.As per the EASL Clinical Practice Guidelines, in noncirrhotic patients, imaging alone is not considered sufficient and tissue assessment is required to establish diagnosis [25].In Table 1, typical radiological findings of cirrhosis, portal hypertension, and HCC are provided.

Artificial intelligence, machine learning, and deep learning
The term Artificial Intelligence (AI) is an umbrella term and refers to a suite of technologies in which computer systems are programmed to exhibit complex behaviour that would typically require intelligence in humans or animals [36].The long overarching goal of AI is to enable machines to perform intellectual tasks such as decision making, problem solving, perception and understanding human communication, inspired by the human cognitive function.
Machine Learning (ML), a subset of AI, provides systems the ability to automatically learn and improve from experience without being explicitly programmed [37].The conventional ML pipeline includes the steps of preprocessing, feature extraction, classification, and evaluation.As medical imaging datasets often have variations in characteristics such as contrast, resolution, orientation, side-markers, and noise, it is important to apply pre-processing techniques to improve the dataset quality.After data cleaning, a relevant region-of-interest (ROI) is selected using either fully-automatic segmentation, semi-automatic segmentation, or manual delineation by experts.After this step, salient features specific to the pattern of a particular medical condition are extracted, in a feature extraction step.Once features are extracted, ML algorithms are applied to map extracted features to the target task, such as classification.Overall, ML allows machines to learn from a set of data and subsequently make predictions on a new test data.Applications of ML in medical imaging date back to the early 1980 s when computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems were developed [38].These CAD(e/x) systems were based on a pre-defined set of explicit parameters, features, or rules developed from expert knowledge.However, one of the major limitations of classical ML systems is the need for handcrafted feature engineering, which is subjective, requires domain expertise and is time-consuming and often brittle.
Deep Learning (DL) [39], a subset of ML, uses multiple layers of neural networks to progressively extract higherlevel features from the raw input, overcoming the limitations of hand-crafted feature engineering in classical ML systems.In DL, layers of neural networks are stacked in an hierarchy with increasing complexity and abstraction to obtain high-level representation of the raw data.DL-based Fig. 1 HCC is most common type of primary liver cancer.Alpha-fetoprotein (AFP) is one of the most widely used biomarkers for HCC screening, diagnosis, and prognosis of liver diseases models have demonstrated state-of-the-art performance on variety of tasks in various fields such as computer vision, natural language processing, speech, and medical imaging.The success of deep learning is attributed to the availability of large-scale annotated datasets, enhanced computing power with the rise of graphics processing units (GPUs), and novel algorithms and architectures.
In the following subsections, we provide an overview of traditional CAD systems using machine learning and deep learning for diagnosing liver diseases from US images.

Machine learning based CAD systems
Before the rise of deep learning, classical machine learning based Computer-Aided Detection and Diagnosis (CAD(e/x)) systems involved a pipeline of handcrafted feature extraction and a trainable classifier.The machine learning enabled CAD(e/x) systems assist radiologists in image interpretation, disease detection, segmentation of regions-of-interest (ROI) such as tumours, and statistical analysis of the extracted tumour.In addition to their use in diagnostics, CAD systems are integrated in the clinical workflow to triage and prioritise patients based on urgency, in turn maximising operational performance.A typical CAD system follows a standard pipeline consisting of the following four steps:

Deep learning
Deep learning (DL) [39] is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data.DL uses multiple layers to represent data abstractions to build computational models.DL has shown high levels of performance on various complex tasks such as speech recognition [46], machine translation [47], object detection [48], caption generation [49], and visual question answering [50].Some key deep learning algorithms include convolutional neural networks [51], recurrent neural networks [52], and generative adversarial

Image pre-processing method Description
Mean filter [40] The mean filter replaces each pixel value in an image with the mean value of its neighbouring pixels, including itself Median filter [40] The median filter replaces each pixel value in an image with the median of neighbouring pixels, including itself Wiener filter [41] The Wiener filter is based on statistical properties to filter out the noise that has corrupted the original signal Bilateral filter [42] It is a non-linear, edge-preserving, and noise-reducing smoothing filter.It does the spatial averaging without smoothing edges Gaussian filter [43] It is linear smoothing filter where the filter (kernel) weights are chosen according to the shape of the Gaussian function Unsharp masking [44] The unsharp masking technique sharpens an image by calculating the difference between orignal and its blurred version.It increases the contrast of small details in the magnified texture Histogram equalisation [44] It is a technique of adjusting image intensities to enhance contrast.This is achieved by stretching out the most frequent intensities, helping low contrast regions to achieve high contrast.The histogram equalisation method helps to improve the global contrast of the image Adaptive histogram equalisation [44] It is adaptive method that computes several histograms, each corresponding to a distinct region of the image, and uses them to redistribute the intensity values of the image.Adaptive histogram equalisation is suitable for improving local contrast in the image CLAHE [45] Compared to histogram equalisation and adaptive histogram equalisation that are global contrast enhancement methods, the Contrast Limited Adaptive Histogram Equalisation (CLAHE) performs local contrast enhancement.This has been widely adopted in improving lower contrast in ultrasound imaging networks [53].In the following subsection we provide an overview of the building blocks of a typical convolutional neural network, which is one of the de-facto deep learning algorithms for processing 2D (images) or 3D (volumetric) data.

Convolutional neural networks
Convolutional Neural Networks (CNNs) are a special type of deep neural networks that are good at handling twodimensional data such as images or three-dimensional data The ANOVA is a statistical method that computes the differences and their variations among the given classes in the data.Based on the statistical analysis, p-value and F-value are computed, based on which significant features are selected Mutual Information The mutual information (MI) quantifies the amount of information obtained from one variable through the second variable.Using higher-order statistics calculated using MI, we can select features which can maximise the MI between subset of selected features and the target variable Fisher score The Fisher score selects each feature independently based on their scores under the Fisher criterion, providing a subset of most representative features Locality Sensitive Discriminant Analysis (LSDA) The LSDA is a feature reduction techniques based on the analysis of studying relationship between data points.The LSDA is effective because it preserves both discriminant and local geometrical structures in the data • Convolutional layer: The convolutional layer is the core building block of a CNN which uses the convolution operation in place of general matrix multiplication.Its parameters consist of a set of learnable filters, also known as kernels.The main task of the convolutional layer is to detect features within local regions of the input image that are common throughout the dataset and map their appearance to a feature map.The output of each convolutional layer is fed to an activation function to introduce non-linearity.There are a number of activation functions available such as Rectified Linear Unit (ReLU), Sigmoid, etc. • Sub-sampling (Pooling) layer: In CNNs, the sequence of convolutional layer is followed by pooling layer which reduces the spatial size of the input and thus reduce the number of parameters of the network.A pooling layer takes each feature map output from the convolutional layer and down samples it.In other words, the pooling layer summarises a region of neurons in the convolution layer.The most common pooling techniques are max pooling and average pooling.
Max pooling takes the largest value from a patch of the feature map, whereas average pooling takes the average of each patch for the feature map.

Algorithm Description
Naive Bayes (NB) The Naive Bayes is a probabilistic classifier based on the Bayes' Theorem.It predicts the class of a given sample by computing the maximum posterior probability based on the prior probability and the observed likelihood in the training set.The sample is assigned a class with the highest occurring probability K-Nearest Neighbour (KNN) The K-Nearest Neighbour classifier is one of the lazy statistical learning algorithms.The training data in KNN algorithm acts as a feature space and during testing, the test sample is compared to all the training samples using a distance metric and label of the training sample having least distance is assigned to the test sample.To improve its robustness, the contributions of the K-neighbours is adopted to decide the label of the test sample Logistic Regression (LR) The Logistic Regression is one of the powerful and baseline methods of supervised classification.The ordinary regression is extended to give the probability of outcome between 0 and 1.To use logistic regression as a binary classifier, a threshold is set based on which a sample is discriminated between two classes Decision Tree (DT) A decision tree is a tree-based classifier where an internal node represents feature, the branch represents a decision rule, and each leaf node represents the outcome.The decision tree classifier provides the benefits of easy interpretation and efficient handling of outliers Support Vector Machine (SVM) The SVM classifier aims to find the optimal hyperplane with the largest margin between positive and negative samples in the high-dimensional feature space.Kernel functions such as Gaussian and Radial Basis Function are used for non-linear mapping of the training data from input space to higher-dimension feature space.The SVM classifier is suitable for complex datasets and shows good generalisation ability on unseen test set Random Forest (RF) The RF classifier is an ensemble learning method in which multiple classifiers' predictions are voted to form the final prediction.In general, ensemble learning methods are robust and provide superior performance given pros and cons of single classifier Extreme Learning Machine (ELM) The ELM is a single-layered feed-forward neural network which can be trained in a single pass, making it faster than conventional machine learning algorithms.The ELM has three layers (input, hidden, and output).The weights from input to hidden are randomly initialised and are fixed.During a single pass, the weights from hidden to output layer are learnt by the classifier • Dropout: Dropout is a regularisation techniques heavily used in convolutional neural networks.In dropout, some units or connections are randomly dropped (skipped) with a certain probability.Due to multiple connections, a neural network co-adapts by learning non-linear relations.Dropout helps to overcome this co-adaptation by randomly dropping some of the connections or units, preventing the network from overfitting on the training data.• Fully connected layer: In fully connected layers, each neuron from the previous layer is connected to every neuron in the next layer and every value contributes to predicting the class of the test sample.The output of the last fully connected layer is passed through an activation function, generally softmax, which outputs the class scores.Fully connected layers are mostly used at the end of the CNN for the classification task.

Representative convolutional neural networks
The history of convolutional neural networks dates back to LeNet-5 [54] which was proposed for digit recognition.Due to the lack of computational resources at that time, LeNet usage was limited.The LeNet-5 model uses tanh as a nonlinear activation function followed by a pooling layer and three fully connected layers.It was the AlexNet [51] model, which made a major breakthrough by drastically reducing the top-5 error rate on the ImageNet challenge compared to the previous shallow networks.Since AlexNet, a series of CNN models have been proposed that have advanced the state of the art steadily on the ImageNet.In AlexNet, tanh is replaced by rectified linear units (ReLU) and the dropout technique is used to selectively ignore units to avoid overfitting of the model.In order to boost predictive performance, Visual Geometry Group at Oxford developed VGG-16 [55].VGG increased the requirements of memory and computational power because of increased depth of the network with 16 layers combined with convolution and pooling layers.In order to limit memory requirements, various structural or topological decompositions were applied which led to more powerful models such as GoogleNet [56], and Residual Networks (ResNet) [57].GoogleNet uses an Inception module which computes 1 × 1 filters, 3 × 3 filters and 5 × 5 filters in parallel, but applies bottleneck 1 × 1 filters to reduce the number of parameters.Further changes were made to the original Inception module by removing the 5 × 5 filters and replacing them with two successive layers of 3 × 3 filters, which is called Inception v2.Szegedy et al. [56] released the Inception v3 model where depth, width and number of features are increased systematically by increasing the feature maps before each pooling layer.ResNet was the first network having more than 100 layers using an idea similar to Inception v3.In ResNet, the output of two successive convolutional layers and input bypassing the two layers are combined, which act very similar to a Network-in-Network.ResNet further increases predictive performance by leveraging rich combinations of features, but keeping the computation low.In Table 6, a summary of some of the most common representative CNN models is provided.

CNN training strategies
In this section, we highlight various strategies which are helpful in training deep convolution neural networks and improving their performance.

Evaluation measures
For evaluating any model, precision, recall, F1-measure, and accuracy scores are computed using the confusion matrix.detected as a healthy one.
• Precision calculates the fraction of correct positive detection of Cirrhosis.• Recall measures how good all the positives are, which depends on the percentage of total relevant cases correctly classified by the model.It is also called sensitivity.• F1-measure is the harmonic mean between precision and recall.
For a binary classification task, the confusion matrix is a 2 × 2 table reporting four primary parameters known as False Positives (FP), False Negative (FN), True Positives (TP), and True Negatives (TN).
Receiver Operating Curve (ROC) is a 2D graphical plot between the True Positive Rate (Sensitivity) and the False Positive Rate (Specificity).The ROC represents the tradeoff between sensitivity and specificity.The Area Under the ROC Curve (AUC) represents a measure of how well a model can discriminate between patients with liver diseases and a healthy group of individuals. (1)

Contributions
This article aims to characterise diagnosis, staging, and surveillance of liver diseases using medical imaging, machine learning and deep learning techniques through a methodical review of the literature.We seek to answer the following research questions: To answer these questions and draw our insights, we methodically studied 77 articles from a variety of publication venues, mostly published between January 2010 to December 2021.There has been surveys related to liver diseases [62][63][64][65][66][67].The survey by [62] mostly focused on diffuse liver diseases and cover only conventional CAD systems.[63] focused on radiographic features under different medical imaging modalities for diagnosing liver diseases.Similar to [62,64,65] focused on conventional ML pipeline for diagnosing liver lesions using US imaging.Although [66] provided details about both machine learning and deep learning models for diagnosing liver diseases using US imaging, they did not follow a systematic approach.In Table 7, current study is compared to the existing surveys.The current study provides a more detailed and systematic approach of the current state-of-the-art ML and DL approaches for diagnosing liver diseases.We also provide details about the datasets, methods of severity scoring, and professional societies guidelines.We close the review by discussing the limitations of existing studies and noting future research directions to further improve diagnostic performance, expediting clinical workflow, augmenting clinicians in their decision making, and reducing healthcare cost.The review is structured as follows: • Sect.7 provides search strategy in terms of selected databases, inclusion and exclusion criteria, and keywords related to search query.• Sect.8 provides a systematic review of diagnosing liver diseases using ultrasound imaging.• Sect.9 provides an overview of various public datasets for the diagnosis of liver diseases.• Sect. 10 provides current limitations and future research directions.

Search strategy
We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [68] to perform our review.

Data sources and search queries
We conducted a comprehensive search to identify potentially all relevant publications on the application of AI including machine learning and deep learning to the diagnosis of liver diseases using medical imaging.The Web of Science, Scopus, IEEE Xplore, and the ACM digital library were queried for articles indexed from Jan 2010 up to 31 December 2021.We included articles written in English and excluded those in the form of editorial, erratum, letter, note, or comment.
In Table 8, our inclusion and exclusion criteria are provided.We first identified keywords and their associations to form our search query.For ease of search, we divided our keywords based on four main concepts.The first concept refers to keywords related to liver diseases such as chronic liver disease(s), acute liver disease(s), liver lesion(s), nonalcoholic fatty liver disease, and hepatocellular carcinoma.The  9 shows all keywords for tasks relevant to the second concept.The third concept relates to various imaging modalities by which liver diseases are diagnosed.These include ultrasound, contrast-enhanced ultrasound, computed tomography, and magnetic resonance imaging.The third row in Table 9 shows various keywords and their abbreviations for concepts related to imaging modalities.The fourth concept belongs to keywords related to computer applications such as computer-aided diagnosis, machine learning, and deep learning.For each concept we included associated keywords as well as their abbreviations to make the search criteria complete.The final query is the logical AND of all the four concepts.A complete picture of concepts,keywords related to each concept, and the final search query on databases is presented in Table 9.

Article selection
The search query retrieved in total 1,878 studies (Web of Science: 455, Scopus: 543, IEEE Xplore: 786, and ACM digital library: 94).We created an EndNote library for our screening process.We first used "Find Duplicates" function in EndNote to find any potential duplicate studies.The software highlighted 356 duplicate studies which we removed, with a total of 1,700 studies left.Given the limited functionality of the in-built EndNote function for finding duplicates, we then manually removed duplicates.
After the manual duplicate removal step, we were left with a total of 1,494 studies.We started our first screening step based on article title and abstract.We found that there were many studies which are for other human organs but using one of the imaging modalities for diagnosis.After our first screening, we were left with 608 relevant studies.In the second screening step, we read the full-texts of articles and found studies on animals (3), other topic (30), review papers (42), and studies focusing on specific medical condition (9).After excluding all these studies, we were left with a total of 524 studies after our second screening.In the third screening step, we separated studies based on imaging modalities.Out of 524 studies, 243 studies belonged to CT, 81 studies to MRI, and 147 studies to US.However, 53 studies used liver biopsy, blood bio-marker, urine bio-marker, and diffraction enhanced imaging for diagnosing liver diseases, In the final stage of our screening, we reviewed 147 liver US studies and assessed the quality based on the questions in our quality assessment given in Table 10.After the final quality assessment, we were left with a total of 62 studies.During Is the train and test data source clearly defined? 4 Are the data pre-processing techniques clearly defined and their selection justified?5 Are the feature extraction or feature engineering techniques clearly described and justified?6 Are the learning algorithms clearly described?7 Does the study perform the comparison with the existing baseline models?8 Is the performance of the proposed system evaluated and results properly interpreted and discussed?9 Does the conclusion reflect the research findings?
Page 13 of 33 73 our detailed review, we also went through the references of these 62 studies and found a few relevant studies which our search query was unable to find, to obtain a total of 77 studies.We have reviewed the 77 studies in detail and provide a detailed discussion of the applications, methodology and results.In Fig. 2, the flowchart of our complete article selection process following the PRISMA guidelines is shown.Finally, we conducted a methodical review and qualitative analysis of the 77 studies in accordance with our inclusion and exclusion criteria.The article selection was performed by the first author (S.S.) and was agreed by all other authors of this article.Due to the heterogeneity and multidisciplinary nature of the included studies, a formal meta-analysis is not possible.We did, however, visually determine overall performance by representing different performance metrics including sensitivity, specificity, F1-score, and receiver operating characteristic (ROC) curves, which are presented later.

Review of studies on the diagnosis and staging of liver diseases
In this section, we review selected studies based on the disease of interest.In Table 11, a summary on studies for fibrosis classification or staging is provided.Most of these studies extracted textural features and applied conventional machine learning algorithms for classification.A few of studies [69,70] performed fusion of multiple ultrasound modalities to improve diagnostic performance on fibrosis staging.In Table 12, a summary of studies on cirrhosis classification is provided.The focus is on separating normal cases from Cirrhosis.In terms of methods, studies have applied both conventional machine learning and deep learning methods.Table 13 provides summary of studies focusing on nonalcoholic fatty liver disease diagnosis.Most of the studies applied a combination of texture feature extraction algorithms and conventional ML classifiers.Table 14 One study [71] performed six class classification with six labels namely, normal, steatosis, chronic hepatitis without cirrhosis, compensated cirrhosis, decompensated cirrhosis, and HCC.In Fig. 3, an year-wise number of studies is shown.The plot shows an increasing trend in the number of studies over years.In Fig. 4, we provide number of studies applying ML or DL methods for diagnosing liver diseases.The plot shows that machine learning was a de-facto choice before 2017.However, with the rise of deep learning, there has been a sharp increase in the number of studies using deep learning methods.In Fig. 5, the distribution of studies is given based on various applications.Of the 77 studies covered in this review, the distribution is as: The results of studies in Table 14 show that as the number of classes increase, the performance of models degrade.This is due to the correlation between diseases and overlapping biomarkers of diseases.In Table 15, a summary of studies focusing on different liver lesions and hepatocellular carcinoma is presented.A major observation is that most of the studies focused on small private (in-house) datasets containing only a few samples of liver lesions such as hemangioma (HEM) or metastases (MET).The studies provide quite high classification performance in terms of accuracy scores.However, accuracy is not a robust metric especially when working with imbalanced datasets.Finally, Table 16 provides summary of studies focusing on HCC prognosis.Most of these studies focused on patient survival analysis.Few of the recent studies showed that CNNs outperforms classical ML algorithms for HCC diagnosis and prognosis.

Public datasets and online initiatives for the diagnosis of liver diseases
Most of the reviewed studies in this article use private (inhouse) datasets.Recently, the research community has felt the need to release benchmark datasets in the public domain so that computational methods can be fairly compared.The ImageNet dataset [60] has been one of the underlying factors in the success of deep learning in computer vision because it enabled targeted progression and objective comparison of methods proposed by the community around the world.With the same motivation, researchers in the field of biomedical image computing have started sharing their curated datasets publicly to advance research and foster fair comparison of methods.We provide an overview of various liver datasets below: • B-mode fatty liver ultrasound: [103]

Limitations and future directions
In this section, we outline limitations of existing studies based on our extensive literature review.We also propose future research directions to overcome these limitations.

Limitations
• Focus on classification Most of the studies focused on the classification task, i.e., binary classification such as: Normal vs. Fatty, Normal vs. Fibrosis, Normal vs.
Cirrhosis or multi-class classification such as: Normal vs. Fibrosis vs. Cirrhosis, Normal vs. Hepatocellular Carcinoma vs. Metastasis vs Hemangiomas.However, very little work has been done on disease progression and severity scoring for diseases.• Small in-house datasets Although there has been a lot of work on diagnosing liver diseases using various imaging modalities such as ultrasound, CT, and MRI, there are only a few publicly available datasets.Most studies in our literature review worked on in-house data which are often of small size.Also, these datasets suffer from a class-imbalance problem.As accuracy score is not a reliable metric on imbalanced datasets, the lack of publicly available benchmark datasets often limits the true assessment of proposed algorithms in these studies.• Classical CAD system still prevalent Currently, deep learning has shown tremendous performance improvement across various fields such as computer vision, natural language processing, robotics, and biomedical image processing.Specifically in biomedical image computing, deep learning has shown superior results on various tasks such as classification, segmentation, and tracking.Due to the lack of publicly available large-scale annotated datasets on liver US studies, the classical machine learning pipeline is still prominent in the community.

Future research direction
• Need for multidisciplinary approach: The management of HCC encompasses multiple disciplines including hepatologists, diagnostic radiologists, pathologists, transplant surgeons, surgical oncologists, interventional radiologists, nurses, and palliative care professionals [166].A study by [167] showed that the development of true multidisciplinary clinic with a dedicated tumour board review for HCC patients increased survival; due to improved staging and diagnostic accuracy, efficient treatment times and increased adherence to clinical diagnostic and therapeutic guidelines.There-fore, the AASLD recommends referring HCC patients to a centre with multidisciplinary clinic.• Make use of multi-modal data: Current state-of-theart deep learning models when trained on multi-modal data such as B-mode images, Doppler images, contrastenhanced ultrasound images, and SWE images could improve the early staging and diagnosis of HCC.Multimodal data can provide complementary information, in turn helping models to improve.• Need for benchmark datasets: In order to push the community's effort in improving diagnostic performance by proposing novel methods, there is a need to establish a benchmark environment with the release of a largescale annotated dataset in the public domain.Similar to benchmark algorithms during challenges, the task organisers can release annotated training and validation data but not release test data labels.Once participants have fine-tuned their methods, they may submit predictions on the test set on the challenge evaluation server.

Conclusion
HCC-related morbidity and mortality continues to uptrend due to delays in diagnosis and treatment as early disease is often asymptomatic.Ultrasound is the recommended first-line imaging modality for diagnosis of chronic liver disease and to screen for HCC; however, contrast-enhanced studies are required to confirm HCC diagnosis.In this paper, we first provide an overview of current diagnostic methods for stages of liver disease.We then lay the foundation of methods such as Image preprocessing, feature extraction, and classification for classical machine learning algorithms and a brief overview of convolutional neural networks, which are specialised deep learning algorithms for processing 2D or 3D data.Then, we reviewed the use of these methods as diagnostic tools in chronic liver disease and HCC.We also discussed the studies reviewed in the survey.Finally, we provide future research directions in assisting diagnostic accuracy and efficiency in clinical workflow.We believe that by adapting AI technologies into medical radiology, diagnostic imaging tools have the potential to be implemented in first-line management of chronic liver disease and HCC.

73 Page 8
of 33 such as videos.CNNs have been successfully applied in medical imaging problems such as skin cancer, arrhythmia detection, fundus image segmentation, thoracic disease detection, and lung segmentation.CNNs consists of multiple layers stacked together which use local connections known as local receptive field and weight-sharing for better performance and efficiency.A typical CNN architecture consists of the following layers:

Fig. 2
Fig. 2 PRISMA flowchart for including articles in our study.The flow diagram depicts the flow of information through the different phases of the methodical review.It maps out the number of studies identified, included and excluded, and the reasons for exclusions Fibrosis classification (n=10), Cirrhosis classification (n=7), NAFLD (n=22), CLD (n=8), FLL (n=24), and HCC diagnosis and prognosis (n=6).Most of these studies focus on the classification problem considering one disease versus rest of the diseases.

Fig. 3 Fig. 4
Fig. 3 Number of studies per year.In total there were 77 selected studies that meet our selection criteria released a B-mode US dataset for the diagnosis of NAFLD steatosis assessment using ultrasound images.It contains 550 B-mode ultrasound scans and the corresponding liver biopsy results.The dataset was collected from 55 subjects admitted for bariatric surgery in the Department of Internal Medicine, Hypertension and Vascular Diseases, Medical University of Warsaw, Poland.• SYSU-CEUS: The SYSU-CEUS dataset [157] contains 353 CEUS videos of three types of focal liver lesions, namely, 186 instances of Hepatocelluar carcinoma (HCC), 109 instances of Hemangioma (HEM), and 58 instances of Focal Nodular Hyperplasia (FNH).Datasets specific to liver tumours have also been made available through participation in online challenges.• LiTS: The Liver Tumor Segmentation Challenge(LiTS) [158] dataset provides 201 contrast-enhanced 3D abdominal CT scans and segmentation labels for liver and tumour regions.Each slice of the volume has a resolution of 512 x 512 pixels.Out of 231 volumes, 131 carry their respective annotations whereas no ground-truth labels are provided for the test set containing 70 volumes.The in-plane resolution ranges from 0.60 mm to 0.98 mm, and the slice spacing from 0.45 mm to 5.0 mm.• SLIVER07: The Segmentation of the Liver 2007 (SLIVER07) [159] dataset is a part of the grandchallenge organised in conjunction with the MIC-CAI 2007 for liver tumor segmentation.The training data consists of 10 tumors from 4 patients with their ground-truth segmentations.For the test set, consisting of 10 tumors from 6 patients, the ground-truth was not made available to public by the task organisers.The dataset contains liver tumor CT images corresponding to portal phase of a standard four-phase contrast enhanced imaging protocol.

Table 1
Typical radiological findings of liver disease

Table 2
Brief overview of various pre-processing methods to remove noise and enhance the image quality

Table 3
Brief overview of various feature extraction methods Average gray level (Mean), standard deviation, variance, skewness, kurtosis, uniformity, energy, entropy Statistical feature matrix (SFM) Coarseness, Contrast, Periodicity, and Roughness Law's Texture Energy Measures Law's texture energy measures based on five coefficient vectors to represent level (L), edge (E), spot (S), ripple (R), and wave (W).In total 18 texture features can be extracted FPS Radial Sum and Angular Sum of the discrete Fourier transform Dissimilarity, Contrast, Correlation, Homogeneity, Autocorrelation, Cluster shade, Cluster prominence, Maximum probability, Sum of Squares, Sum Average, Sum Variance, Sum Entropy, Difference Variance, Difference Entropy, Information measure of Correlation, Inverse Difference moment-Normalized Moment Invariant (MI) A set of moments invariant to rotation, scaling, and translation derived from second and third normalised central moments Gradient based features Mean, Variance, Kurtosis, Skewness, and percentage of pixels with non-zero gradient Gray-level run-length matrix (GLRLM) Short run emphasis, Long run emphasis, Gray-lvel non-uniformity, Run-length non-uniformity, Run percentage, Low gray-level run emphasis, High gray-level run emphasis, Short run high gray-level emphasis, Long run low gray-level emphasis, Long run high gray-level emphasis Gabor Wavelet Transform (GWT)Mean and standard deviation of Gabor output images obtained by using a set of Gabor wavelets at different scales and orientationsGeometricCentre of gravity x, Centre of gravity j, Height, Width, Area, Perimeter, Roundness, Euler number, Major axis length, Minor axis length, Orientation, Solidity, Extent, Eccentricity, Convex area, Danielsson factor, Filled area Frequency-domain Discrete Cosine Transform (DCT) features, Discrete Wavelet Transform (DWT) features, Wavelet Packet Transform (WPT) features, Curvelet Transform (CT) features, Stationary Wavelet Transform (SWT) Phase congruency Variance, contrast, covariance Gabor texture Multiple Gabor filters having different frequencies and orientation can be used to extract specific features from an image Table 4 Brief overview of various feature selection algorithms Algorithm Description Principal Component Analysis (PCA) It is statistical technique that converts high-dimensional data to low-dimensional data by selecting the most important features that capture maximum information about the dataset.The top most relevant features are selected based on the variance that can explain in the original dataset Pearson's Correlation Coefficient It measures the correlation between features to find out which features are highly correlated and which are not.Based on this analysis, the features that are redundant and do not add value to the final prediction are dropped Analysis of Variance (ANOVA)

Table 5
Brief overview of various machine learning algorithms [60]sfer learning[59]refers to the ability to share and transfer knowledge from a source task to the target task.Convolutional neural networks learn features in an hierarchical manner, whereby early layers learn generic image features such as edges and corners, whereas later layers learn features specific to the dataset.Given that it is challenging to obtain large-scale annotated datasets in the medical domain due to cost and time constraints, transfer learning helps to leverage the learning of models trained on large-scale datasets such as the ImageNet[60].
[61]ansfer learning: • Data augmentation: Current state-of-the-art CNNs need large-scale annotated data to train in a supervised manner.Given the complexity of CNN models, it is easy for them to overfit on small size medical imaging datasets.Data augmentation[61]is a technique to generate synthetic data, for example by applying different affine transformations such as rotation, scaling, translation, flipping, and adding noise.Data augmentation not only increases the dataset size during training, but also adds diversity to the data, making the model robust on unseen data.

Table 6
[57]esentative CNN architectures and their high-level description The first CNN model to win the ImageNet challenge in 2012 and brought deep learning revolution.Compared to LeNet, the AlexNet use ReLU activation function, dropout for regularisation, data augmentation during training, and splitting computation on multiple GPUs VGG[55]A popular deep CNN model from University of Oxford.The VGG network popularised the idea of using small filter kernels and training the deeper network using pre-training on shallower versions.Two the popular variants of VGG network are: VGG-16 (having 16 layers) and VGG-19 (having 19 layers) GoogLeNet[56]Winner of the 2014 ImageNet challenge.This model contains multiple inception modules, which provides the idea of multiscale processing allowing modules to extract features at different levels of detail simultaneously.By stacking multiple CNN layers, model becomes quite complex, yet having less number of model parameters.One of the popular GoogLeNet network is the Inception-v3 ResNet[57]Winner of the 2015 ImageNet challenge.ResNet networks contains skip connections providing information preserving capability by simply copying the activations from lower layers to higher layers.By concatenating and stacking multiple ResNet blocks, it made possible to have much deeper networks, yet having lesser model parameters.Having skip connections in addition to the standard pathway gives network the ability to preserve more information, increasing network's ability to pick and lose information, learning residuals, and building deeper networks.Major ResNet network variants include ResNet-18, ResNet-50, ResNet-101, and ResNet-152 DenseNet [58] The DenseNet model uses concatenation of the activations of previous layers to the activation of the current layer.The use of feature maps of all previous layers to the current layer helps to achieve feature reuse capability and reducing training parameters.The idea of concatenating activations from previous layers preserve global state, making DenseNets particularly well-suited for smaller datasets, especially medical imaging datasets.One of the important DenseNet model that has been applied by the medical imaging community is the DenseNet-121 model

Table 7
Comparison of review articles related to our survey paper, with their methods and scope

Table 8
List of inclusion and exclusion criteria in Table9shows all keywords relevant to the first concept.The second concept relates to various tasks such as classification, detection, segmentation, and staging of liver diseases.The second row in Table 2 Study should not focus on diagnosing liver diseases using other imaging modalities such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), serum biomarkers, liver biopsy, Magnetic Resonance Imaging derived Proton Density Fat Fraction (MRI-PDFF), etc 3 Studies not having technical contribution such as white papers, cases studies, letters, abstracts only first row

Table 9
Search query related to four main concepts which are combined to formulate final query for four databases

Table 12
Summary of studies on Cirrhosis classification using liver US images

Table 16
Summary of studies on HCC diagnosis and prognosis using liver US images