Keywords

1 Introduction

Liver cancer is the second most common cause of cancer-related deaths worldwide among men, and the sixth among women [1]. Radiological examinations, such as computed tomography (CT) images and magnetic resonance images (MRI) are the primary methods of detecting liver tumors. Computer-aided diagnosis (CAD) systems play an important role in the early and accurate detection and classification of FLLs.

Currently, multi-phase CT images, which are also known as dynamic CT images, are widely used to detect, locate and diagnose focal liver lesions. Multi-phase CT scans are generally divided into four phases (i.e. non-contrast phase, arterial phase, portal phase, delay phase). Between the non-contrast phase and the delay phase, the vascularity and the contrast agent enhancement patterns of the liver masses can be assessed. We observe that, when human experts diagnose the type of FLLs, they tend to zoom out the CT images to figure out the detail of lesions [2], and they also need to look back or forward in different phases. The observation interprets the importance of the combination of local with global information and the temporal enhancement pattern.

Some published studies have reported on the characterization of FLLs using multiphase images to capture the temporal information among phases. Roy et al. [3] proposed a framework to extract spatiotemporal features from multiphase CT volumes for the characterization of FLLs. In addition to conventional density features (the normalized average intensity of a lesion) and texture features (the gray-level co-occurrence matrix [GLCM]), temporal density and texture features (the intensity and texture enhancement over the three enhancement phases compared with the non-contrast phase), were employed. Compared with low-level features, the mid-level features such as bag-of-visual-words (BoVW) and its variants have proven to be considerably more effective for classifying FLLs [4,5,6,7,8,9]. In most of the BoVW-based methods, the histograms in each phase are separately extracted and then they are concatenated as a spatiotemporal feature [5, 8, 9] or the averaged histogram over multiple phases is used to represent the multi-phase images [4]. They ignore the temporal enhancement information and relationship among phases.

In recent years, the high-level feature representation of deep convolutional neural networks (DCNN) has proven to be superior to hand-crafted low-level features and mid-level features [10]. Deep learning techniques have also been applied to medical image analysis and computer-aided detection and diagnosis. However, there have been very few studies on the classification of focal liver lesions. Frid-Arar et al. [11] proposed a multi-scale patch-based classification framework to detect focal liver lesions. Yasaka et al. [12] proposed a convolutional neural network with three channels corresponding to three phases (NC, ART and DL) for the classification of liver tumors in dynamic contrast-enhanced CT images. The method can extract high-level temporal and spatial features, resulting in a higher classification accuracy compared with the state-of-the-art methods. The limitation is that it lacks information on image pattern enhancements.

In this paper, we propose a framework based on deep learning, called ResGL-BD-LSTM, which combines a residual network (ResNet) with global and local pathways (ResGLNet) [13] and a bi-directional long short-term memory (BD-LSTM) model for the classification of focal liver lesion. The main contributions are summarized as follows:

  1. (1)

    We extract features from each single phase CT image via the ResGLNet. The input of the ResGLNet is a pair (patch and ROI) that represent the local and global information, respectively, to handle inter-class similarities.

  2. (2)

    We extract an enhancement pattern, hidden in multi-phase CT images, via the BD-LSTM block, to represent each patch. To the best of our knowledge, expressing temporal features (enhancement patterns) among multiphase images using deep learning has not been investigated previously.

  3. (3)

    We propose a new loss function to train our model, and provide a more robust and accurate deep model. The loss function is composed of an inter-loss and intra-loss. The inter-loss minimizes the inter-class variations and the intra-loss minimizes the intra-class variations, updating the center value using a back-propagation process.

2 Methodology

A flowchart of the proposed framework is shown in Fig. 1. The ResGLNet block, which extracts local and global information from each single phase, will be described in detail in Sect. 2.1. The BD-LSTM block, which extracts the enhancement pattern, will be described in detail in Sect. 2.2. The method combining the ResGLNet block and BD-LSTM block will be described in Sect. 2.3. We will introduce the loss function and training strategy of the framework in Sect. 2.4. In Sect. 2.5, we describe the features extracted from the label map, and how we accomplish the lesion-based classification.

Fig. 1.
figure 1

The flowchart of our framework

2.1 ResGLNet

In this sub-section, we describe ResGLNet block, which was proposed in our previous work [13]. The ResGLNet involves a local pathway and global pathway. Intuitively, these extract local and global information, respectively. The employed ResGLNet is an extension of the ResNet proposed by [10]. We utilize three ResGLNet blocks, which each have the same architecture but do not share weights with each other, to extract the information of the three respective phases. In each ResNet block, we used 19 convolutional layers, one pooling layer (avg-pooling), and one fully connected layer. Each convolution layer was followed by a rectified linear unit (ReLU) activation function and a batch normalization layer.

Global Pathway.

First, we apply a random walk-based interactive segmentation algorithm [14] to segment healthy tissue and focal liver lesions. The segmented results were checked by two experienced radiologists. The segmentation was performed for each phase image separately. During a clinical CT study, the spatial placement of tissues formed in multiple phases exhibits some aberration, owing to differences in a patient’s body position, respiratory movements, and the heartbeat. Therefore, to obtain a factual variation of the density over phases, a non-rigid registration technique in order to localize a reference lesion in other phases [15]. Each segmented lesion image (i.e., 2D slice image) was resized to \( 128 \times 128 \). The resized images were then used as input for global pathway training and testing.

Local Pathway.

Patches were extracted from ROIs. Each patch has a label, \( c \in \left\{ {c_{0} ,c_{1} ,c_{2} ,c_{3} } \right\} \) where \( c_{0} \) represents a cyst, c1 represents an focal nodular hyperplasia (FNH), c2 represents an hepatocellular carcinoma (HCC) and c3 represents an hemangioma (HEM). Owing to the different lesions varying significantly in size, extreme imbalances occur among the patch categories. To solve this problem, the pace value is derived in Eq. (1):

$$ pac{e_i} = \left\{ {\begin{array}{lll}&{floor\left( {\sqrt {\frac{{{w_i}*{h_i}}}{ \epsilon }} } \right),}&{{w_i}*{h_i}\, > \, \epsilon }\\&{1,}&{{w_i}*{h_i}\, \le \, \epsilon }\end{array}} \right. $$
(1)

where \( i \) represents the i-th ROI; \( pace_{i} \) is the pace of i-th ROI for extracting the patches, \( w_{i} \) and \( h_{i} \) respectively represent the width and height of the i-th ROI, \( \epsilon \) represents a threshold that can limit the number of patches, and the floor function represents rounding-down. For the testing dataset, we still set the pace to 1. As in the global pathway approach, we resized the patches to \( 64 \times 64 \).

2.2 BD-LSTM

A recurrent neural network (RNN) can maintain self-connected status acting as a memory to remember previous information when it processes sequential data. Long-short term memory (LSTM) is a class of RNN that can avoid the vanishing gradient problem.

Bi-directional LSTM (BD-LSTM), which stacks two layers of LSTM, is an extension of LSTM. The two layers of LSTM, which are illustrated in Fig. 1, work in two opposite directions to extract useful information from sequential data. The enhancement information carried in the two layers of LSTM is concatenated as the output. One layer is in the z-direction, and extracts the enhancement pattern from the NC phase through the PV phase and the other is in the \( z^{ + } \)-direction and extracts the anti-enhancement pattern from the PV phase through the NC phase.

2.3 Combining ResGLNet and BD-LSTM

The motivation of performing focal liver lesions classification based on multi-phases CT images by combining ResGLNet and BD-LSTM is to employ multi-phases CT images as sequential data. The ResGLNet extracts the information (i.e., intra-phase information) based on a single phase. The BD-LSTM distills enhancement information (i.e., inter-phase information) among three phases, and the length of sequential data is a constant number (i.e., 3). The two blocks work in coordination, as follows.

The output of the three ResGLNet blocks, as a sequential data, constitutes the input of the BD-LSTM. Furthermore, the output of the two layers LSTM (i.e., BD-LSTM), representing patches, constitutes the input of the fully connected layer. The softmax layer following the last fully connected layer produces output that gives the result of the patch-based classification.

2.4 Training Strategy

Loss Function.

Let N be the batch size and \( \omega^{t} \) be the weights in the t-th (t = 1, 2, …T) layer. We use \( \varvec{W} \) to denote the weights of the mainstream network (involving three ResGLNet blocks and a BD-LSTM block). We used \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\text{W}}_{\text{local}} \,{\text{and}}\,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\text{W}}_{\text{global}} \) to represent the weights of the local and global pathways (involving three ResGLNet blocks, the same below), respectively. Furthermore, \( p\left( {j\,\left| {\,x_{i} ;\varvec{W}} \right.} \right) \) represents the probability of the i-th patch belonging to the j-th class. We define \( p\left( {j\,\left| {\,x_{i} ;\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\text{W}}_{\text{global}} } \right.} \right) \) and \( p\left( {j\,\left| {\,x_{i} ;\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{\text{W}}_{\text{local}} } \right.} \right) \) similarly. The definitions of cross-entropy are as follows:

$$ {{\mathcal{L}}_{last}} = \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{c = 1}^K { - p\left( {j\,\left| {\,{x_i}; {\varvec{W}}} \right.} \right) * \log \left( {p\left( {j\,\left| {\,{x_i};{\varvec{W}}} \right.} \right)} \right)} } $$
(2)

Thus, we can obtain the definition of \( {{\mathcal {L}}_{local}} \) and \( {{\mathcal {L}}_{global}} \) for the same reason. And the definition of the inter-loss that as follows:

$$ {{\mathcal{L}}_{inter}} = \frac{1}{2} * {{\mathcal{L}}_{last}} + \frac{1}{4} * {{\mathcal{L}}_{local}} + \frac{1}{4} * {{\mathcal{L}}_{global}} $$
(3)

The definition of the intra-loss (i.e. center loss [16]) is as follows:

$$ {{\mathcal{L}}_{intra}} = \frac{1}{2}{\sum\limits_{i = 1}^N {\left\| {{f_i} - {c_{y_i}}} \right\|}^2} $$
(4)

Here \( \varvec{f}_{\varvec{i}} \) is the representation feature of the i-th patch and \( \varvec{c}_{{\varvec{y}_{\varvec{i}} }} \) denotes the \( y_{i} \)-th class center of features. In the course of training \( \varvec{c}_{{\varvec{y}_{\varvec{i}} }} \) should be updated using the process of back-propagation. To accelerate our training, we conduct the update operation based on each batch, instead of basing it on the entire training set. Note that, in this case, some of the centers may not change. The method employed to update the centers is described as follows:

$$ \begin{aligned} \varvec{c}_{\varvec{j}} = & \,\varvec{c}_{\varvec{j}} - {\varvec{\Delta}}\varvec{c}_{\varvec{j}} = \varvec{c}_{\varvec{j}} - \frac{{{\boldsymbol{\partial} \mathcal{L}}}}{{{\boldsymbol{\partial }}\varvec{c}_{\varvec{j}} }} \\ = & \,\varvec{c}_{\varvec{j}} - \alpha * \frac{{\sum\nolimits_{i = 1}^{N} {\delta \left( {y_{i} = = j} \right)\left( {\varvec{f}_{\varvec{i}} - \varvec{c}_{\varvec{j}} } \right)} }}{{1 + \sum\nolimits_{i = 1}^{N} {\delta \left( {y_{i} = = j} \right)} }} \\ \end{aligned} $$
(5)

Here \( \delta \left( {y_{i} = = j} \right) \) = 1 if the \( y_{i} = = j \) holds, and \( \delta \left( {y_{i} = = j} \right) \) = 0 otherwise. Furthermore, \( \alpha \) can restrict the learning rate of the centers, where the range of \( \alpha \) is (0, 1).

Finally, we adopt a joint loss that combines the intra-loss and inter-loss to train the frameworks. The formulation of the optimized loss is given in Eq. 6.

$$ {\mathcal{L}} = {{\mathcal{L}}_{inter}} + \lambda {{\mathcal{L}}_{intra}} $$
(6)

Training Process.

Our framework is split into two phases. The first is the training phase, and the second is the testing phase. During training, we first trained the part of our framework that involves the deep learning components and then aggregated the label maps. The effectiveness of the patches from the validation dataset determines when the training stop. We aggregated the label maps belonging to the training and validation datasets after the model was trained. Next, we used the training and validation label maps as input to the support vector machine (SVM) classifier. Then, we also determined the parameters of the SVM (classifier of lesions) using the effectiveness of label map of the validation dataset.

2.5 Post-processing of Label Map and Classification of Lesions

After training, we aggregated the label map of each lesion. Then, we extracted features from the label map. The features are as follows:

$$ feature_{i} = \left\{ {\beta_{i0} ,\beta_{i1} ,\beta_{i2} ,\beta_{i3} } \right\} $$
(7)

Here \( feature_{i} \) represents the feature vector of i-th label map, and \( \beta_{ij} \), is derived in Eq. (8), denotes the proportion of pixels belonging to the j-th category of in the i-th label map. Then we use the SVM to achieve lesion-based classification.

$$ \beta_{ij} = \frac{the\,number\,of\,pixels\,belong\,to\,jth\,category}{the\,total\,pixels\,in\,ith\,label\,map} $$
(8)

3 Experiments

3.1 Data and Implementation

A total of 480 CT liver slice images were used, containing four types of lesions confirmed by pathologists, (i.e., Cyst, HEM, FNH, and HCC). The distribution of our dataset is shown in Table 1. The CT images in our dataset are abdominal CT scans taken from 2015 through 2017. The CT scans were acquired with a slice collimation of 5–7 mm, a matrix of \( 512 \times 512 \) pixels, and an in-plane resolution of 0.57 − 0.89. In our experiment, we randomly split our dataset into a training dataset, a validation dataset, and a testing dataset. In order to eliminate the effect of randomness, we conduct the partition operation twice, and form two groups of dataset.

Table 1. The distribution of database.

Our framework was implemented using the Tensorflow library. We initialized the parameters via the Gaussian distribution. We used a momentum optimizer to update our parameters by setting the learning rate initialized as 0.01 and the momentum coefficient to 0.9. We set the batch size as 100. The parameters for our algorithm were \( \lambda = 0.1 \), \( \alpha = 0.2 \), \( \epsilon = 128 \), and patch size = 7.

3.2 Results

In order to validate the effectiveness of or proposed methods. We compared our results with the state-of-art methods with low-level features [2], mid-level features [4,5,6,7,8] and CNN with local information [10] and global information [11]. We also compared our proposed methods with different architectures: ResNet with local patch (w/o intra-loss), ResGLNet [13], ResGL-BDLSTM (w/o intra-loss), and ResGL-BDLSTM (with intra-loss). The comparison results (classification accuracy) are summarized in Table 2. It can be seen that our proposed methods outperformed the state-of-the-art methods [3, 5,6,7,8,9, 11, 12]. The ResNet with local and global pathways outperformed the ResNet with local patch only. The classification accuracy was significantly improved by adding the BD-LSTM model, as well as the intra-loss.

Table 2. Comparison results (classification accuracy (%) is represented as mean and standard deviation)

4 Conclusions

In this paper, we proposed a method using combined residual local and global pathways and bi-directional long short-term memory (ResGL-BDLSTM), to tackle the classification of focal liver lesions. The ResGLNet extracts the most representative features from each single phase CT image, and the BD-LSTM helps to extract the enhancement patterns in multi-phases CT images. The experimental results demonstrated that our framework outperforms other state-of-the-art methods. In the future work, we are going to build a large scale liver lesions dataset and to construct an end-to-end framework that achieves lesion-based classification via one model. We believe that our proposed framework can be applied to other contrast-enhanced multi-phases CT images.