1 Introduction

Polyps are important precursors to the colorectal cancer, which is the third most common cancer diagnosed in both men and women [8]. It is reported that there will be 140,250 new colorectal cancer cases in United States in 2018. Wireless capsule endoscope (WCE) [4] has revolutionized the diagnostic inspection of the Gastrointestinal (GI) polyps and remains the first-line tool for screening of abnormalities. It provides a noninvasive, direct visualization of the whole GI tract. Despite sustained technical advancements, WCE videos are still manually reviewed, which is extremely laborious and time-consuming.

Many computational methods have been developed to automatically detect polyps from WCE images [1, 7, 9,10,11,12]. Gueye et al. [1] introduced scale-invariant feature transform descriptor algorithm with bag of word approach for feature extraction. With the support vector machine (SVM) classification method, they discriminated polyp tissues from the normal images. Zhang et al. [12] proposed a novel transfer learning application for automatic polyp image detection using Convolutional Neural Network (CNN) features learned from non-medical domain. Segui et al. [6] fused the original RGB images and corresponding Laplacians and Hessians transformations together as an input of the network, and utilized basic CNN architecture with three convolution-max pooling layers followed by a fully connected layer to conduct polyp detection task. The existing methods [1, 7, 9, 10, 12] utilized low level hand-craft features or traditional CNN models to represent information of WCE images. It is reported that traditional features, designed for natural images, could not achieve good performance for medical images due to their poor generalization capability. Although deep CNN networks usually extract better features than shallow network and lead to better performance, the gradient vanish problem may occur and impede the convergence of the network as the layer of CNN goes deeper. The recent Densely Connected Convolutional Network (DenseNet) [3] demonstrated inspiring performance in image classification. It connects each layer to every other layer in a feed-forward fashion to learn features. In this regard, it ensures maximum information flow between layers in the network, and alleviates the vanishing-gradient problem, strengthens feature propagation and encourages feature reuse. While the DenseNet has shown impressive success, they still could not effectively deal with the specific challenges in polyp detection, such as object rotation and intra-class variability. The WCE has no active motor and it is propelled by peristalsis along the GI tract. Therefore, the same polyp may exist in different positions of the collected images. Moreover, the polyps may be located at any location of the GI tract with different mucosa surfaces and they are usually small, thus the collected polyps demonstrate large intra-class variance.

In this paper, we propose a novel rotation-invariant and image similarity constrained Densely Connected Convolutional Network (RIIS-DenseNet) to differentiate the polyps from normal WCE images. We first introduce the DenseNet to build the deep learning model. Then we elaborate the DenseNet by introducing rotation invariant constraint, which enforces the training samples and their corresponding rotation ones to share the similar feature representations and hence achieving rotation-invariance. Moreover, a novel image similarity constraint is proposed to enable images to be coincident with corresponding category directions in the learned feature space. With the joint loss function, our model could learn discriminative features of WCE images and further promote the performance of polyp detection.

2 Method: RIIS-DenseNet Model

In this paper, we propose a novel RIIS-DenseNet model to differentiate the polyp images from normal ones. The workflow of our method is shown in Fig. 1. Our method consists of following four steps. First, we rotate the collected training WCE image to augment the datasets. Then we introduce DenseNet, which connects each layer to every other layer in a feed-forward fashion, to learn features from WCE images. The joint loss function including softmax loss, the rotation-invariant constraint and the image similarity constraint is proposed to evaluate loss values in the training procedure. The fourth part of our RIIS-DenseNet model fuses these three loss function together in the DenseNet model to obtain the final discriminative WCE image features and further detect polyps.

Fig. 1.
figure 1

Workflow of our proposed RIIS-DenseNet. It consists of two parts: data rotation augmentation and RIIS-DenseNet. The RIIS-DenseNet includes three denseblocks, two transitions, one convolution layer, one pooling layer and a novel joint loss function layer.

2.1 Data Rotation Augmentation

Given a set of initial polyps and normal WCE training samples \(X=\{X_1,X_2\}\), we generate a new set of training samples \(X_{new}=\{X,T_{\phi }X\}\). \(T_{\phi }\) represents a group of K rotation transformations and we consider \(\phi =\{45^{\circ },90^{\circ },135^{\circ },180^{\circ }, 225^{\circ },270^{\circ },315^{\circ }\}\). N defines the original number of training images, then the size of new training images is \(N\times (K+1)\). In this way, we enlarge the training data to deal with limited WCE images. The corresponding labels of \(X_{new}\) are defined as \(L_{new}=\{l_{x_i}|x_i \in X_{new}\}\) where \(l_{x_i}\) with only one element being 1 at the \(m^{th}\) position (\(m=1~or~2\)).

2.2 DenseNet

A deep network usually extracts better features than shallow network and leads to better performance [2]. However, as the network goes deeper, the gradient vanish problem may occur and impede the convergence of the network. In this work, we utilize the DenseNet [3] to alleviate the vanishing-gradient problem and strengthen feature propagation.

The principal characteristic of DenseNet is the dense connectivity. That is, each layer of the structure is directly connected to every other layer in a feed-forward way. This strategy encourages the direct supervision signal for all layers, enables the reuse of features among layers and strengthens feature propagation. Given \(x_{l-1}\) represents the output of the \({l-1}^{th}\) layer, \(H_l(x)\) denotes a series of transformations, the output of the \(l^{th}\) layer \(x_l\) is calculated as follows,

$$\begin{aligned} x_l = H_l([x_0, x_1,..., x_{l-1}]), \end{aligned}$$

where [.] refers to the concatenation operation. Suppose that each function \(H_l\) produce k features, then the number of features maps for the lth layer is \(k_0+k\times (l-1)\), where \(k_0\) represents the number of channels in the input layer. The parameter k defines the growth rate and controls the number of parameters in the model. The DenseNet has fewer parameters than traditional networks because it avoids learning redundant features.

Specific, our proposed workflow includes three denseblocks as shown in Fig. 1. Each denseblock is comprised of 6 densely connections transformation layers and the transformation consists of a batch normalization, a rectified linear unit, and a convolution layer with a \(3 \times 3\) kernel. The transition layer connects two denseblocks, and it includes of a \(1 \times 1\) convolution followed by a \(2 \times 2\) max pooling transformation. Following the last dense block, we set a global max pooling layer and a fully connected layer. Then a joint loss function classifier is attached to conduct the polyp detection problem.

2.3 Joint Loss Function

To achieve better characterization of WCE images and improve the polyp detection performance, our proposed RIIS-DenseNet model introduces a novel joint loss function to evaluate loss values between the predict labels and true ones, further learn discriminative features.

Traditional Softmax Function. The most widely used classification loss function is softmax loss, which tends to minimize the misclassification errors for the given training samples. This loss function \(D(X_{new},L_{new})\) is presented by

$$\begin{aligned} D(X_{new},L_{new})= \frac{1}{N\times (K+1)}\sum _{x_i \in X_{new}}l_{x_i}\log (\hat{l_{x_i}}), \end{aligned}$$

where \(\hat{l_{x_i}}\) indicates the probability of image \(x_i\) being correctly classified as class \(l_{x_i}\). The Eq. (2) represents the dissimilarity of the approximated output distribution from the true distribution of labels.

Rotation-Invariance Regularization Constraint. The traditional DenseNet model only uses softmax loss function to minimize classification errors in the training process, this strategy ignores the specific rotation variance in the polyp images, which will inevitably result in misclassification of images that belong to the same category. In order to obtain discriminative deeply learned features, we introduce rotation-invariance regularization \(D(X,T_{\phi }X)\), which enforces the training samples X and the ones after rotating \(T_{\phi }X\), to share similar features. This regularization term is defined as follows,

$$\begin{aligned} D(X,T_{\phi }X)=\frac{1}{N}\sum _{i=1}^{N} \Vert f(x_i)-Mean(f(T_{\phi }x_i))\Vert ^2 \end{aligned}$$

where \(f(x_i)\) represents the learned feature of the training sample \(x_i\). \(Mean(f(T_{\phi }x_i))\) denotes the average feature of rotated versions of the training sample \(x_i\) and it is calculated as

$$\begin{aligned} Mean(f(T_{\phi }x_i))=\frac{1}{K} \sum _{\phi =1}^{K}f(T_{\phi }x_i) \end{aligned}$$

We can find that the rotation-invariance regularization constraint effectively minimizes the distances between the learned features and the average feature representation of its rotated versions. In this regard, it highly enhances the rotation-invariance power of the deeply learned features.

Image Similarity Regularization Constraint. It is intuitive that if two images are in some category, the learned features should be close to each other in the learned feature space. Therefore, the image similarity constraint is introduced to help discover more accurate features. The cosine similarity measurement defines the angle of the feature and its intra-class center. To this end, we propose the image similarity loss constraint D(X) with cosine measurement to emphasize the image similarity.

$$\begin{aligned} D(X)= \frac{1}{N}\sum _{i=1}^{N}(1-\frac{{f^T{(x_i)}} c_{m}}{\Vert f(x_i)\Vert \Vert c_{m}\Vert }) , \end{aligned}$$

where \(c_{m}=\frac{1}{N_{{x_i \in T_m}}}\sum _{x_i \in T_m} f(x_i)\) defines the center of features for the \(m^{th}\) category. The introduced loss function tends to minimize the angle of the feature and its intra-class center and keep the intra-class features close to each other.

Final Joint Loss Function. Based on the above three observations, we adopt the joint supervision of softmax loss, rotation-invariance loss and image similarity loss to train the RIIS-DenseNet for discriminative features. The formulation of the joint loss function is given as follows,

$$\begin{aligned} \begin{aligned} D_{final}=&\frac{1}{N(K+1)}\sum _{x_i \in X_{new}}l_{x_i}\log (\hat{l_{x_i}}) +\frac{1}{N} \sum _{i=1}^{N} \Vert f(x_i)-Mean(f(T_{\phi }x_i))\Vert ^2 \\ +&\frac{1}{N}\sum _{i=1}^{N}(1-\frac{{f^T{(x_i)}} c_{m}}{\Vert f(x_i)\Vert \Vert c_{m}\Vert }), \end{aligned} \end{aligned}$$

The softmax loss globally forces the learned feature vectors of different categories to stay away while the introduced image similarity loss effectively pulls the deep features of the same category to be coincident. Equation (6) also imposes a regularization constraint to achieve rotation invariance. With this joint supervision, the discriminative power of the deeply learned features can be highly enhanced.

Fig. 2.
figure 2

(a) Loss and accuracy values for different iterations. The blue color represents the test loss while the orange color shows the training loss. The black line represents the test accuracy. (b) ROC curves for different baseline methods and ours.

3 Experiment Setup and Results

Our WCE dataset consists of 3000 WCE images, including 1500 polyp frames and 1500 normal ones. These images were extracted from 62 different WCE videos and manually annotated by expert physicians. The original size of each WCE image was \(578 \times 578 \times 3\). The input image size of RIIS-DenseNet model was set as \(256 \times 256 \times 3\) to reduce the computational complexity. The data set was randomly divided into three subsets: a training (80%), a validation (10%), and a test (10%) set to conduct experiment. Data augmentation, including randomly cropping and scaling [5], was implemented to enlarge the training images for the proposed RIIS-DenseNet model.

We implemented our model using Tensorflow on a desktop with Intel Core i7-7700@3.6 GHz processors and a NVIDIA GeForce Titan X with 128 GB of RAM. All training steps used mini-batches of 80 samples. The learning rate, weight decay and momentums of the RIID-DenseNet model were set as 0.01, 0.005 and 0.9, respectively. The performance of polyp detection was evaluated by accuracy (Acc), recall (Rec), precision (Pre) and F1-score (F1). In addition, the receiver operating characteristic (ROC) curve was plotted for evaluation.

We first analyzed the learning process of the proposed RIIS-DenseNet model. The training loss, test loss and test accuracy are shown in Fig. 2(a). The training loss and test loss converge after iterations indicating that the network is successfully optimized to learn the discriminative features for the polyp detection task. Moreover, the test loss consistently decreases as the training loss goes down, demonstrating that no serious over-fitting is observed with our WCE datasets.

Table 1. Comparison with different polyp detection methods.

We then analyzed the influence of introduced constraints for the polyp detection. The first baseline experiment was original DenseNet model, which learned image features without utilizing two proposed loss items: rotation invariant and image similarity constraints. The second and third baseline experiments directly applied the DenseNet method with the single rotation invariant loss or image similarity loss to conduct the polyp detection in WCE images. The corresponding criteria of polyp detection are recorded in Table 1. The ROC curves of our method and three comparison methods are shown in Fig. 2(b). Our RIIS-DenseNet achieves an accuracy of 95.62%, recall of 94.97% and precision of 96.26%, showing significant improvements compared with comparison experiments. This result validates the introduced regularized items: RI and IS constraints have critical roles in learning discriminative features for improving detection performance.

We further assessed the performance of the proposed method by comparing it with the state-of-the-art polyp diagnosis methods: the hand-craft feature based methods [1, 7] and deep learning based methods [6, 12]. We implemented these methods on our datasets and the average accuracy, recall and precision achieved by the existing methods and ours are shown in Table 1, respectively. We found that the deep learning based methods achieve better performance than the methods based on hand-crafted features, suggesting that the high-level features learned from networks are more discriminative than the hand-crafted features. The proposed method shows superior performance with an improvement of 7.19%, 5.70% in accuracy, 8.29%, 4.33% in recall compared with the existing deep learning based methods [6, 12], respectively. This result validates the proposed RIIS-DenseNet model possesses superior ability to characterize WCE images as compared with existing methods.

4 Conclusion

We proposed a novel RIIS-DenseNet model for automatic computer-aided polyp detection with WCE videos. Our method is fundamentally different from the previous works since it does not rely on hand-designed features or traditional CNN model. Instead, it utilizes Dense Convolutional Neural Network to learn more discriminative features from the raw images directly and automatically. The image similarity constraint was introduced in the feature learning procedure to enable the intra-class features similar to each other by minimizing the angle of the feature and its intra-class center in the learned feature space. In addition, a rotation invariant constraint was also proposed in our network to deal with the rotation variance of the WCE images. The accuracy of our methods for polyp detection achieves 95.62%. Our intensive evaluations based on the annotated polyp database demonstrate our method has a superior performance over existing state-of-the-art methods.