Introduction

Liver cancer is one of the most lethal cancers in the world. It was the third major cause of cancer mortality (approximately 830,000 deaths) in 2020 [1]. Computed Tomography (CT) is the most frequently used imaging technique for identifying hepatic cancer. Various Computer Aided Diagnosis (CADx) solutions have been investigated to aid radiologists in decision-making and increase diagnosis efficiency. Liver segmentation is the first and most critical stage of a CADx system and is therefore decisive in determining the success of a diagnosis. However, liver delineation is difficult due to: (i) ambiguous boundaries with adjacent structures, (ii) large shape variability, (iii) the presence of organs with similar intensity in the vicinity, (iv) intensity variations and noise in the liver due to image acquisition and injection protocols [2] and (iv) division of liver into right and left lobes.

Liver segmentation has been the subject of extensive research for over two decades. The earlier studies explored traditional segmentation methods, primarily level set, Fuzzy C-Means (FCM) and region growing. Xu et al. [3] presented a semiautomatic approach in which region growing was utilized for initial liver delineation and level set method for final refinement. Wang et al. [4] suggested a shape–intensity prior level set method using probabilistic atlas and probability map constraints. Eapen et al. [5] delineated the liver using a Bayesian probabilistic level set framework. The different swarm optimization techniques were explored in [6,7,8]. Most of these methods required user intervention and were not very robust.

In recent years, the application of Deep Learning (DL) approaches for segmentation has risen rapidly. Liu et al. [9]. delineated the liver using UNet and dense feature selection. Jeong et al. [10] incorporated long short-term memory network and attention mechanism into UNet. Sun et al. [11]. proposed a UNet based architecture for overcoming the pitfalls in the skip connections and self attention mechanism for liver segmentation. In [12] a 3D version of UNet was developed by incorporating residual connections. Chung et al. [13] presented a Convolutional Neural Network (CNN) by combining auto-context and self-supervised sparse contour attention mechanisms. Ahmad et al. [14]. employed a deep belief network for initial liver delineation and Chan-Vese active contour method for final refinement. Senthilvelan et al. [15] developed a cascaded CNN model consisting of V-Net for initial liver segmentation and H-DenseUNet for final refinement. Araújo et al. [16] cascaded multiple UNets for segmenting simple and complex cases. However, the computational cost was high. Fan et al. [17] presented a variant of UNet, in which the skip connections were modified to extract better features. They also introduced special modules to fuse high- and low-level features and to capture multiscale details. Xie et al. [18] combined dynamic adaptive pooling, residual modules and UNet to segment liver from CT data. Ahmad et al. [19]. developed an efficient CNN initialized randomly with Gaussian weights for liver segmentation. Wei et al. [20] integrated generative adversarial network into mask region-based CNN to enhance liver segmentation results. Wang et al. [21] combined EfficientNetB4, attention gate and residual learning for liver delineation. Wu et al. [22] presented a DL model based on UNet by including pyramidal convolution and attention mechanisms. These works were automatic but only focused on extracting liver from the Portal Venous (PV) phase.

In clinical practice, generally, plain and contrast-enhanced CT images consisting of arterial, PV and delayed phase images are analyzed for tumor identification. Radiologists observe the enhancement patterns (generated by the contrast agent) in and around the tumor to diagnose them. Majority of the research on liver segmentation is centered on segmenting the liver solely from the PV phase. Very few authors have worked on multiple CT phases. For instance, Xu et al. [23] employed a network derived from UNet to segment liver from the triphasic CT data. The approach used by Rusko et al. [24] was based on region growing algorithm. They incorporated various pre and post processing operations using anatomical and multiphase information, to reduce over and under segmentation of liver. These studies required image registration, which is very time consuming.

A liver segmentation method feasible for a CADx system must be automatic, accurate, robust and computationally efficient; being effective for multiple CT phases would further add value to the method. This paper aims to accomplish such a method. We have developed a DL model from SegNet using two key components: Atrous Spatial Pyramid Pooling (ASPP) module and leaky Rectified Linear Unit (ReLU) layers. The ASPP module captured multiscale features without reducing the feature map resolution and the leaky ReLU layers improved the model’s generalizability. Performance evaluation on a public dataset with challenging cases and our institutional dataset (consisting of multiphase CT volumes) yielded satisfactory results. Ablation studies justified the significance of the model components. Comparison with the state-of-the-art techniques indicated that our model was comparatively superior.

The rest of the paper is structured into the following sections: Section "Materials and methods": elaborates on the datasets employed and the proposed method. Section "Experimental results": presents the quantitative and qualitative results, and Section "Discussion": discusses the results obtained. Finally, Section "Conclusion", outlines the contributions and presents the future work.

Materials and Methods

Dataset Details

The training\validation and test datasets were prepared primarily from different databases. Hence, we have presented the details of the datasets in two separate subsections.

Training and Validation We collected 4994 diverse CT images from three databases (two public and one internal), namely, 3D-IRCADb [25], LiTS [26] and Kasturba Medical College (KMC), Manipal. They comprised liver of different shapes and intensity distributions, with/without tumor and abdominal images of lung, heart, intestine etc. that did not have liver. This data was then split into training and validation sets, such that over 78% of the total CT images was used for training and the remaining for validation. Thus, the training and validation datasets comprised 3930 and 1064 CT images, respectively. The data splitting was done manually so that the training and validation datasets contained difficult cases equally. We ensured that both the datasets have similar diversity and that there is no overflow of only certain type of images in either of the two datasets. This was important because only the training images are used by the model for learning features. If the training set would not have images of all the types mentioned above, it would be unable to learn relevant features required for identification of liver from the challenging images. The validation accuracies for two split ratios 78:22 and 90:10 was compared, and it was found that the former ratio gave slightly better results, hence it was considered in the work.

The number of cases and CT images considered from each of the databases, along with other relevant attributes, are detailed in Table 1. The images in the public datasets had a fixed resolution of 512 × 512, whereas the images in the KMC, Manipal dataset had differing resolutions. The 3D-IRCADb and KMC, Manipal datasets are in Digital Imaging and Communications in Medicine (DICOM) format, whereas the LiTS dataset is in Neuroimaging Informatics Technology Initiative (NIFTI) format. Only PV images were considered from the former two databases, whereas multiphase images were taken from the latter database. Since the training images were selected from different databases, reasonable diversity was introduced in the training data.

Table 1 Details of training and validation sets

Test set Two datasets were used to test the proposed model: an internal institutional dataset (KMC, Manipal) and CHAOS, a public dataset [27]. The KMC, Manipal dataset consists of ten CT volumes, out of which six CT volumes have all four phases (plain, arterial, PV and delayed) and the remaining have three CT phases (plain, arterial, PV). This can be seen in Table 4. The dataset has cases with different abnormalities, viz. metastasis, cysts and hepatocellular carcinoma.

The CHAOS dataset comprised twenty CT training datasets (labeled 1, 2, 5, 6, 8, 10, 14, 16, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 in the database) acquired at PV phase. We chose this dataset for evaluation, firstly because it had CT images in DICOM format which is the standard format for medical images. Secondly, the images were acquired using three different scanners namely, Philips SecuraCT with 16 detectors, Philips Mx8000 CT with 64 detectors and Toshiba AquilionOne with 320 detectors. Hence, the robustness of the model could be assessed. Thirdly, because the dataset had challenging cases, as will be discussed in the next sections. It is to be noted that none of the test set images were part of the training set. An overview of the test datasets is provided in Table 2.

Table 2 Details of test datasets

The ground truths for the training, validation and test data required for the KMC dataset were generated using the ITK-SNAP tool [28] under the guidance of an experienced radiologist (co-author) with over twenty years of expertise in medical imaging. The ground truths for the three external datasets are available publicly in the database. The KMC dataset will be hereafter referred to as institutional dataset.

Preprocessing

The following operations were performed on the training/validation datasets before training: (i) the pixel intensities are first converted to Hounsfield units using a linear transformation described in the DICOM documentation, (ii) the images are converted to unsigned 8-bit integer format (Fig. 1b), (iii) the background pixels in the upper region of the image are removed through cropping to reduce the unwanted areas and magnify the abdominal region, (iv) the images are resampled to the dimension 384 × 384 × 3 pixels to satisfy the RGB input requirement of the DL model (Fig. 1c). The image dimension was chosen as a tradeoff between image quality and training time. Training images of higher dimensions gave better results at the cost of higher training time; and lower dimension images reduced the training time, however, produced inferior results. The preprocessed training and validation images were saved in. mat format. For the test sets all the preprocessing operations except step (iii) were performed. It is to be noted that all the above mentioned preprocessing steps were automated.

Fig. 1
figure 1

Overview of the preprocessing steps. a Raw CT axial slice b CT slice in unsigned 8-bit integer format after rescaling c CT slice after cropping and resizing

During training, augmentation techniques like scaling and translation were applied on the training and validation datasets on-the-fly. Hence only the diversity in the data was increased and the dataset size remained the same. The scaling factor was randomly selected from the range 60–100% (horizontally) and 40–100% (vertically). For translation, the value was randomly chosen between [– 25,25] pixels (horizontally) and [– 5,5] pixels (vertically). The ranges chosen ensured that the abdominal region remained sufficiently intact in the image, post augmentation.

Proposed Framework

The proposed SegNet based framework for liver segmentation consists of three parts: an encoder, an ASPP module and a decoder (Fig. 2). The model has four encoder-decoder pairs, consisting of convolution (conv.), Batch Normalization (BN) and leaky ReLU layers. The max unpooling layer at the decoder performs non-linear upsampling of the input feature maps using pooling indices derived at the corresponding encoder's max-pooling layer. The encoder layers are initialized with weights from the VGG-16 network [29] trained on the ImageNet database. The kernel size for the convolutional and max pooling layers are 3 × 3 and 2 × 2, respectively; the stride considered are 1 and 2, respectively, for the two layers. Cross entropy was the loss function employed during training. Table 3 gives the details of the model architecture.

Fig. 2
figure 2

Proposed framework. a Liver segmentation module consisting of four encoder-decoder blocks and ASPP module. b Detailed structure of the ASPP module

Table 3 Detailed architecture of the proposed model

To overcome the dying ReLU problem sometimes encountered in the ReLU layers, we employed the leaky ReLU layers [30] in the encoder and decoder blocks. A leaky ReLU is an activation function that multiplies any input less than zero by a fixed scalar. It is mathematically defined as follows,

$$f\left(x\right)=\left\{\begin{array}{c}scale*x, x<0\\ x, x\ge 0\end{array}\right.$$
(1)

where x is the input value and scale is the scalar, chosen as 0.01 in this work.

The ASPP module performs four convolutions in parallel, one with 1 × 1 filter and the remaining are atrous convolutions (with dilation rates of 6, 12, 18) with 3 × 3 filters (Fig. 2b). Each of these layers is followed by BN and ReLU layers; finally, a concatenation layer fuses the four outputs. The multiple receptive fields incorporated in the ASPP module aid in visualizing the abdominal CT image from different scales [31]. Unlike SegNet, which further downsamples the feature map (to dimension 12 × 12), resulting in information loss, the proposed model retrieves useful multiscale context information from the input feature map (of resolution 24 × 24 pixels) after the fourth encoder (shown in Table 3).

By increasing the receptive field of the convolution filters more context information can be extracted that can aid in identifying the liver region more precisely. Different receptive fields enable the model to learn different features pertaining to the region of interest, like neighboring structures, location and size with respect to other organs and many inherent patterns that are beyond human perception. However, increasing the receptive field makes the model more complex in terms of learnable parameters (weights and biases); thus, increasing the network complexity and training time. These are two very crucial factors for training DL models and hence should be used judiciously. A solution to this issue is to use multiple atrous convolutions instead of standard convolutions. Atrous convolutions with dilation rates of 6,12 and 18 for a 3 × 3 kernel expands the receptive field to 13 × 13, 25 × 25 and 37 × 37, respectively. Hence, the same pixel can be viewed with respect to 168, 624 and 1368 surrounding pixels, thus extracting more information without increasing the learnable parameters. The ASPP module was incorporated to improve the liver segmentation accuracy without increasing the learnable parameters.

Evaluation Metrics

Five standard metrics namely, Dice Coefficient (DC), Jaccard Index (JI), Matthews’s Correlation Coefficient (MCC), Absolute Volume Difference (AVD) and Average symmetric Surface Distance (ASD) evaluated the segmentation accuracy. DC and JI compute the percentage of overlap between the segmented and ground truth volumes [32]; MCC measures the quality of a binary classification. It is appropriate even when the pixels in the two classes are unbalanced [33]. AVD computes the absolute difference between the segmented and ground truth volumes and ASD gives the average distance between the surfaces of the two volumes in mm. Higher values indicate better performance for DC, JI and MCC; the opposite is true for AVD and ASD.

Experimental Results

The programs were implemented in MATLAB R2021a and the DL models were trained on a server with NVIDIA T4 GPU with 16 GB memory. The trained models were evaluated with the test sets on a laptop with Intel Core i7-10750H processor, 16 GB RAM (DDR4) and Windows 10 operating system.

Parameter Setting

The model was trained with Stochastic Gradient Descent with Momentum (SGDM) optimizer with momentum and minibatch size of 0.7 and 2, respectively. The initial learning rate was 0.1. It was lowered by a factor of 0.1 every 50 epochs. The proposed model was trained for eight different epochs viz. 50, 54, 58, 60, 90, 110, 130 and 150 to find the optimal epoch. The DC for the PV phase of the Institutional dataset was computed (Fig. 3). These results were analyzed and since the 150th epoch gave the highest DC, we finalized our epoch as 150.

Fig. 3
figure 3

Dice coefficient for portal venous phase of institutional dataset

Liver Segmentation Results

Tables 4 and 5 show the liver segmentation results for the two test sets. From Table 4, we can see that the best DC achieved was 97.01% (PV phase) and the poorest DC was 87.3% (plain phase). The average DC values obtained for the four CT phases are 96.12% (PV), 94.61% (arterial), 95.01% (delayed) and 93.23% (plain). For the CHAOS dataset, DC greater than 97% was achieved for majority of the cases. The average DC accomplished was 96.69%.

Table 4 Quantitative results for internal institutional dataset
Table 5 Quantitative results for CHAOS dataset

The liver segmentation results for some of the cases are illustrated in Figs. 4 and 5 for the two datasets. The first row of Fig. 4 shows a case that is grainy, the liver has an unusual shape and a large peripheral tumor. Although the contrast between the liver and adjacent structures like rib muscles is lower in the arterial and plain CT phases, the liver is segmented well. The second row shows a case where most part of the liver contains a heterogenous tumor and the contrast between the liver and heart is very low in all the phases except arterial. In both cases, the liver is segmented quite accurately in all phases. Slight discrepancies in the contouring of liver boundary and Inferior Vena Cava (IVC) mainly created differences in the ground truths and predicted masks in Figs. 4 and 5.

Fig. 4
figure 4

Liver segmentation results for multiphase CT images. a Portal venous. b Arterial. c Delayed. d Plain (Ground truth: red, predicted: green) (Note: The images have been cropped for better visualization of the liver and the contours)

Fig. 5
figure 5

Liver segmentation results for CHAOS dataset. (a) Axial CT slice (b) Ground truth mask (c) Predicted liver mask (d) Contours marked on CT image (Ground truth: red, predicted: green)

The training and validation accuracy and loss curves for the proposed model are shown in Fig. 6. The validation curves exhibited large fluctuations until the fifty-second epoch, after which the variations were reduced.

Fig. 6
figure 6

Learning curves of the proposed model a accuracy b loss

Ablation Study

An ablation study was conducted to validate the necessity of the different components of the proposed model. Three DL models were studied: Model 1 (Original SegNet with five encoder-decoder pairs), Model 2 (SegNet with four encoder-decoder pairs), Model 3 (Model 2 with ASPP block inserted between encoder 4 and decoder 4). Tables 6 and 7 summarize the results obtained for the institutional and CHAOS datasets, respectively. Table 8 illustrates the network complexity for the different models.

Table 6 Quantitative results of ablation study on the institutional test dataset
Table 7 Quantitative results of ablation study on the CHAOS dataset
Table 8 Comparison of network complexity

Comparison with Model 1 It can be observed from Table 6 that the proposed model has outperformed the original SegNet (model 1) for all the CT phases and all the metrics, barring AVD for arterial phase and ASD for plain phase (institutional test set). Table 7 shows that for the CHAOS dataset, the proposed model has performed better for all metrics. Table 8 infers that the proposed model is also superior with respect to number of learnable parameters and training time by approximately 42% and 5 h, respectively. These results show that replacing the fifth encoder-decoder pair in the original SegNet with ASPP block and employing leaky ReLU layers, have improved the liver segmentation accuracy; and also reduced the network parameters and training time. Hence, we can say that the proposed model is superior to SegNet (model 1).

Comparison with Model 2 In order to examine the usefulness of the fifth encoder-decoder block in model 1, we removed the same and developed model 2. The number of learnable parameters and the training time were reduced considerably (Table 8). However, the segmentation results in the tables clearly show that model 1 is better than model 2 for most of the phases. Thus, the fifth encoder-decoder block is critical for the SegNet.

Compared to the proposed model, model 2 required 1.9 million lesser learnable parameters. However, the proposed model gave better segmentation results. For the PV, arterial, delayed and plain CT phases of the institutional test set the increase in DC was by 1.45%, 3.05%, 0.5% and 5.02%, respectively; the improvement in JI was by 2.46%, 4.95%, 0.9% and 7.98% for the four phases. The ASD values were better for the proposed model by 7.27 mm, 8.49 mm, 0.41 mm and 4.37 mm for the four phases in the same order. The improvement in MCC was by 1.4%, 2.9%, 0.52% and 4.89%; and AVD was by 2.73%, 6.42%, 0.91% and 6.64%, respectively. For the CHAOS dataset, the improvement was by 0.46%, 0.78%, 0.47%, 0.86% and 1.26 mm for DC, JI, MCC, AVD and ASD, respectively. These results indicate that by removing the fifth encoder-decoder pair, although there is some reduction in the learnable parameters, there is deterioration in the outcomes of segmentation. Thus, a four encoder-decoder SegNet is not as effective as our proposed model for liver segmentation.

Comparison with Model 3 In an attempt to get high accuracy along with a reduced number of learnable parameters, we investigated model 3, in which an ASPP module was inserted after the fourth encoder block. Model 3 outperformed model 2 in accuracy; however, it required 3.5 h more for training. Compared to model 1, model 3 gave better results barring ASD for arterial phase (Table 6). For the CHAOS dataset, model 1 gave better results than model 3 for all metrics except ASD (Table 7). However, the latter model took 1.5 h less than the former for training.

The proposed model has performed better than model 3 for all metrics except AVD for the PV phase. For the arterial phase, ASD was better by 1.15 mm (Table 6). For the CHAOS dataset, the proposed model has given better results for all metrics. In addition, although both the models had the same number of learnable parameters (17.1 M), the proposed model needed lesser training time. It is noted from Table 6 that model 3 performed slightly better than the proposed model for some of the phases in the institutional dataset. Nevertheless, the better results for the PV phase of the institutional dataset and the challenging CHAOS dataset; and the lesser training time required make the proposed model superior to model 3.

The above results emphasize that replacing the fifth encoder-decoder pair with the ASPP block and incorporating leaky ReLU instead of ReLU layers has enhanced the performance of the original SegNet. The improvement is in terms of accuracy, computational complexity, training time and model generalizability.

Comparison with Other DL Models

A comparative analysis with other widely used semantic segmentation networks like UNet, DeepLab v3 + and SegNet was performed and the results are reported in Table 9, 10 and 11. The proposed model outperformed all the DL models in all CT phases except for AVD in arterial phase and ASD in plain phase.

Table 9 Comparison of other DL models on the internal test dataset
Table 10 Comparison of other DL models on the CHAOS dataset
Table 11 Comparison of network complexity

UNet gave the poor results for most CT phases (Table 9). The increase in DC for our model compared to UNet was by 1.6%, 2.08%, 1.6% and 4.16% for the PV, arterial, delayed and plain phases, respectively. The improvement in JI was by 2.81%, 3.62%, 2.8% and 6.87%; and MCC was better by 1.53%, 2.06%, 1.64% and 4.26% for the four phases. The ASD metrics improved by 4.7 mm, 3.09 mm and 0.23 mm for PV, arterial and delayed phases. For the CHAOS dataset, the proposed model was better by 0.5%, 0.89%, 0.55%, 0.71% and 0.3 mm for DC, JI, MCC, AVD and ASD, respectively (Table 10).

When compared to DeepLab v3 + , the DC of our model was better by 0.7%, 0.56%, 0.66% and 0.78% for PV, arterial, delayed and plain CT phases, respectively. The improvement in JI was by 1.28%, 1.02%, 1.19% and 1.35% for the four phases. The ASD metrics were better by 2.98 mm, 1.61 mm, 0.34 mm for PV, arterial and delayed phases, respectively. For the CHAOS dataset, proposed model gave the best results and DeepLab v3 + gave the poorest results (Table 10). The improvement in DC, JI, MCC, AVD and ASD were by 3.47%, 3.95%, 3.01%, 3.87% and 0.77 mm, respectively for the proposed model. The comparison of the proposed model with SegNet has already been discussed in the previous subsection.

From Table 11 it can be observed that the proposed model required the least number of learnable parameters (17,115,718 ~ 17.1 million). The learnable parameters constitute weights and biases from the convolutional layers; and offset and scale from the batch normalization layer. The other layers do not have any learnable parameters. The details of the parameters in these layers are given in Table 12. Compared to SegNet, UNet and DeepLab v3 + models, our proposed model requires approximately 42%, 86% and 61% lesser learnable parameters. The minimum training time required was 48.5 h for UNet, our model required 3.5 h more; however our segmentation results were better. Compared to the remaining models, our model was superior in terms of learnable parameters and segmentation accuracy. Hence, we conclude that our model is the best considering all the aspects.

Table 12 Layer wise details of the learnable parameters of the proposed model

It was observed that all the DL models segmented the simple cases equally well. However, in the majority of the challenging cases, the proposed model outperformed the other models. The liver segmentation results for some of the unusual cases from the CHAOS dataset are presented in Fig. 7. Here, first column, shows the results for case 14, where the liver has varying intensity and nonuniform texture due to contrast injection. SegNet and UNet exhibited over and under segmentation; and DeepLab v3 + gave abysmal results for all the slices in the volume. However, apart from the inclusion of IVC, our model delineated the liver quite precisely. For case 6, the shape of liver is atypical and the boundary with spleen is vague. It is apparent from Fig. 7c (second column) that the proposed model has quite successfully segmented the liver compared to the other models. For cases 23 and 28, the other models incorrectly segmented parts of the spleen as liver, producing False Positives (FP). The proposed model has segmented only the liver regions, including both lobes for case 28. In case 25, an unusual liver shape is delineated best by the proposed model. The above analysis implies that our model is a clear improvement compared to the well-known DL models.

Fig. 7
figure 7

Comparison of liver segmentation results of different DL models. a Input axial CT image b ground truth c proposed method d SegNet e DeepLab v3 + f UNet. g Contours marked on CT image (Ground truth: Red, Proposed: Cyan, SegNet: Blue, DeepLab v3 + :Yellow, UNet: Magenta)

Figure 8 shows the segmentation results obtained for healthy (third and fourth columns) and unhealthy liver (first and second columns). In the first and second columns the tumors are present near the border of the liver. The liver has however been segmented well in both cases. The third column shows liver with two lobes, that has been segmented accurately. The fourth column shows a healthy liver with heart and other structures with similar intensity in the vicinity, also segmented precisely by the proposed model.

Fig. 8
figure 8

Segmentation results for unhealthy and healthy liver (Red and yellow contours indicate ground truth and predicted output, respectively)

Discussion

The observations made based on our study are outlined in this section. It is seen that most of the literature on semantic segmentation of the liver has focused on UNet. SegNet based architectures are very rarely used. This trend may be because UNet was initially developed for medical image understanding and segmentation, whereas SegNet was primarily used for road scene segmentation. Our findings suggest that SegNet and the proposed SegNet-based model delineate the liver more accurately than UNet for all the CT phases.

The ablation studies have highlighted that integrating the ASPP scheme into the four encoder-decoder SegNet model improves the segmentation accuracy. Although the DeepLab v3 + network also uses the scheme, our investigations reveal that the proposed model is more efficient, effective and robust (Tables 9, 10 and 11). The proposed model requires 61% fewer learnable parameters and comparatively lesser training time.

The ablation studies indicate that leaky ReLU layers in the encoder and decoder sections have made the model more robust. The liver segmentation results for case 14 (with hyperdense liver) of the CHAOS dataset is depicted in Fig. 9. The results illustrate that model 3 (with ReLU layers) could not segment the liver as effectively as the proposed model (Fig. 8 (c) and (d)). Although both the models gave similar results for most cases, the results obtained for case 14 of the CHAOS dataset illustrate that the proposed network is more robust.

Fig. 9
figure 9

Comparison of liver segmentation results of model 3 and the proposed model for case 14. a Input axial slice. b Ground truth. c Proposed model. d Model 3. e Liver contours marked on CT image (Ground truth: Red, Proposed: Green, Model 3: Blue)

Our model was trained on CT images from three databases namely, 3D-IRCADb, LiTS and our institutional database. We tested our model on two test sets: (a) CHAOS dataset and (b) our institutional dataset and achieved satisfactory results with both. It is to be noted that no images from the CHAOS database were included in the training set and that the test data from the institutional database were separate from the training images from the same database. Since the images in the CHAOS dataset were acquired using different scanners as mentioned earlier, good results for this dataset proves the robustness of our model. Moreover, it was effective in segmenting the liver from multiple CT phases (plain, arterial, PV and delayed) although it was mainly trained on PV images. Hence, our model has the potential to be integrated into a CADx system.

Table 13 shows a comparison of the proposed method with other recent works that employed the CHAOS dataset. The metrics compared were DC and JI, the former metric was specified in all the works whereas the latter was specified only in few works. The proposed has given better results compared to these works. Mulay et al. [34] presented a method based on Holistically-nested Edge Detection and Region-Convolutional Neural Network for liver segmentation. Their approach required the images to be enhanced through adaptive histogram equalization and sigmoid function. The DC value obtained by their method was 94%. Lei et al.[35] proposed a U-shaped network that employed improved pooling operation and skip connections and achieved DC = 95.58%. Khan et al. [36] integrated UNet, residual networks, dilated convolutions and a new loss function to segment liver and reported DC and JI of 95.49% and 89.13%, respectively. Wu et al. [22] developed a CNN based on UNet, multiscale processing and attention mechanism and obtained a DC of 96.12%. The DC was only slightly less (0.57%) compared to the proposed method (DC = 96.69%), however the JI was lower by 0.93%. Moreover, their work required 24.97 million parameters while the proposed method utilized only around 17 million parameters. Since our model has performed better compared to the other works, in terms of DC, JI and other parameters we can conclude that our model is superior.

Table 13 Comparison of the proposed method with other works that have used CHAOS dataset

To sum up, the advantages of the proposed model compared to the other architectures are that (i) it delineates the liver more precisely from all the CT phases, (ii) it is more robust as complex and uncommon cases, especially in the CHAOS dataset (liver with unconventional shapes, heterogeneous and hyperdense intensity distribution) were comparatively segmented in a better manner. Moreover, the model was trained mainly on the PV phase images and tested on all four CT phases, (iii) it produces fewer FPs that can adversely affect the diagnosis made by the CADx systems, (iv) it is better than other state-of-the-art methods that employed the same dataset (CHAOS) and (v) it is simple to implement as it is built from existing components. The key limitation of the model is that it has not identified the IVC in many cases. Another shortcoming is its sensitivity to the CT image format. Although it was trained using CT images in DICOM and NIFTI formats, the algorithm works better on the former and gives inferior results for CT images in the latter format.

Conclusion

This study developed a DL model for liver segmentation from multiphase abdominal CT volumes. The network was trained on CT images from different databases and tested on two diverse datasets, firstly an institutional multiphase CT dataset and secondly on a public dataset. The experimental results of a comparative study indicate that the proposed model is superior to some of the commonly employed DL models. It has performed well in terms of accuracy, learnable parameters and training time. Hence, we believe that our liver segmentation algorithm is suitable for incorporation into a CADx system. The future work includes constructing a hepatic CADx system for differentiating between normal and abnormal liver and diagnosing liver cancer.