Using spatial-temporal ensembles of convolutional neural networks for lumen segmentation in ureteroscopy

Purpose: Ureteroscopy is an efficient endoscopic minimally invasive technique for the diagnosis and treatment of upper tract urothelial carcinoma (UTUC). During ureteroscopy, the automatic segmentation of the hollow lumen is of primary importance, since it indicates the path that the endoscope should follow. In order to obtain an accurate segmentation of the hollow lumen, this paper presents an automatic method based on Convolutional Neural Networks (CNNs). Methods: The proposed method is based on an ensemble of 4 parallel CNNs to simultaneously process single and multi-frame information. Of these, two architectures are taken as core-models, namely U-Net based in residual blocks($m_1$) and Mask-RCNN($m_2$), which are fed with single still-frames $I(t)$. The other two models ($M_1$, $M_2$) are modifications of the former ones consisting on the addition of a stage which makes use of 3D Convolutions to process temporal information. $M_1$, $M_2$ are fed with triplets of frames ($I(t-1)$, $I(t)$, $I(t+1)$) to produce the segmentation for $I(t)$. Results: The proposed method was evaluated using a custom dataset of 11 videos (2,673 frames) which were collected and manually annotated from 6 patients. We obtain a Dice similarity coefficient of 0.80, outperforming previous state-of-the-art methods. Conclusion: The obtained results show that spatial-temporal information can be effectively exploited by the ensemble model to improve hollow lumen segmentation in ureteroscopic images. The method is effective also in presence of poor visibility, occasional bleeding, or specular reflections.


Introduction
Upper Tract Urothelial Cancer (UTUC) is a sub-type of urothelial cancer which arises in the renal pelvis and the ureter. The disease, has an estimated number of 3,970 patients affected in 2020 [1] in the United States. Flexible Ureteroscopy (URS) is nowadays the gold standard for UTUC diagnosis and conservative treatment. URS is used to inspect the tissue in the urinary system, determine the presence and size of tumour [2] as well as for biopsy of suspicious lesions [3]. The procedure is carried out under the visual guidance of an endoscopic camera [4].
Navigation and diagnosis through the urinary tract are highly dependent upon the operator expertise [5]. For this reason, the current development of methods in Computer Assisted Interventions (CAI) intends to support surgeons by providing them with relevant information during the procedure [6]. Additionally, within the endeavours of developing new tools for robotic ureteroscopy, a navigation system which relies on image information from the endoscopic camera is also needed [7].
In this study, we focus on the segmentation of the ureter's lumen. In ureter-endoscopic images, the lumen appears most likely as a tunnel or hole in the images with its center being the region with the lowest illuminance inside the Field of View (FOV). Lumen segmentation presents some particular challenges such as the difficulty of defining the concrete boundary of it, the narrowing of the ureter around the ureteropelvic junction [4], and the appearance of image artifacts such as blur, occlusions due to the appearance of floating debris or bleeding . Some examples of these, present in our data, are shown in Fig. 1.
In the CAI domain, Deep Learning (DL)-based methods, represent the state-of-the-art for many image processing tasks, including segmentation. In [8] an 8-layer Fully Convolutional Network (FCN) is presented for semantic segmentation of colonoscopy images for different classes, including lumen in the colon, polyps and tools. In [9] a U-Net-like architecture based on residual blocks for lumen segmentation in ureteroscopy images is proposed. However, these DL-based approaches in the field of CAI only use single frames, which dismisses the chance of obtaining extra information from temporal features.
The exploitation of spatial-temporal information has shown to obtain better performances than approaches that only process single frames. In [10] a model based on 3D convolutions is proposed for the task of tool detection and articulation estimation, and in [11] a method for infants limb-pose estimation in intensive care uses 3D Convolutions to encode the connectivity in the temporal direction.
Additionally, recent results in different biomedical image segmentation challenges have shown the effectiveness of DL ensemble models, such as and in [12] where an ensemble consisting of 4 UNet-like models and one Deeplabv3+ network was proposed obtaining the 2nd place in the 2019 SIIM-ACR pneumo-thorax challenge, and in in [13] where an ensemble which analyzed single-slices data 3D volumetric data separately was presented, obtaining top performance in the HVSMR 3D Cardiovascular MRI in Congenital Heart Disease 2016 challenge dataset.
Inspired by both paradigms our research hypothesis is that the use of ensembles which use both, single-frame and consecutive-frames information could achieve a better generalization in data than models which uses only one of them. For this purpose we propose an ensemble model which uses in parallel 4 Convolutional Neural Networks which can exploit the information contained in single-frame and continue-frames, of ureteroscopy videos.

Proposed Method
As introduced in [14,12], we considered the use of ensembles to reach a better generalization of the model when testing it on unseen data. The proposed ensemble of CNNs for ureter's   lumen segmentation is depicted in Fig. 2. Our ensemble is fed with three consecutive frames [I(t − 1), I(t), I(t + 1)] and produces the segmentation for the frame I t . The ensemble is made of two pairs of branches. One pair (the red one in Fig. 2) consists of U-Net with residual blocks (m 1 ) and Mask-RCNN (m 2 ), which process the central frame I t . The other pair (orange path in Fig. 2) processes the three frames with M 1 and M 2 , which extend m 1 and m 2 as explained in Sec. 2.1.
It is important to notice that frames constituting the input for any M are expected to have the minimal possible changes, but still significant to provide extra information which could not be obtained by other means. Some specific examples in our case study include the appearance of debris crossing rapidly the FOV, the sudden appearance or disappearance of some image specularity, a slightly change in the illumination or the position of the element we are interested to segment. For this reason, we consider only three consecutive frame I t−1 , I t , I t+1 as input for the model.
The core models m 1 , m 2 on which our method is based are two state of the art architectures for instance segmentation: 1. (m 1 ): The U-Net implementation used in this work is based on residual units as used in [9], instead of using the classical convolutional blocks, this is meant to to address the degradation as proposed in [15].
2. (m 2 ): Is an implementation of Mask-RCNN [16] using ResNet50 as backbone. Mask-RCNN is composed of different stages. The fist stage is composed of two networks: a "backbone", which performs the initial classification of the input given a pretrained network, and a region proposal network. The second stage of the model consists of different modules which include a network that predicts the bounding boxes, an object  p × q × nc (where p and q refers to the spatial dimensions and nc to the number of channels (ch) of each individual frame) pass through an initial 3D Convolution with n k number of kernels. The output of this step has a shape of size (1, p − 2, q − 2, n k ) which is padding with zeros in the 2nd and 3rd dimensions to latter, and then reshaped to fit as input for the m core-models classification network and a FCN which generate the masks for each RoI.
Since our implementation is made of different sets of models, the final output is determined using an ensemble function F (p i (t)) defined as: where p i (t) corresponds to the prediction of each of the k = 4 models for a frame I(t).

Extending the core models for handling multi-frame information
For each core model m, an extension M is obtained by adapting the architecture for processing multi-frame information.
Let I be an ordered set of n elements I ∈ N p,q,nc corresponding to frames of a video, where p and q represent spatial dimensions and n c the number of color channels (Fig. 3). Starting from any core model (m), which takes as input elements from I, we can define another segmentation model (M ) which receives multi-frame information from I. Specifically, it receives inputs of the form I ∈ N r,p,q,nc , where r = 3 represent the temporal dimension (number of frames). To this aim, the core model m is extended by prepending an additional 3D Convolution layer with n k kernels of size (r × 3 × 3). The new layer produces an output H ∈ N 1,p−2,q−2,n k , so that feeding it into m is straightforward. The issue of having p − 2 and q − 2 instead of p and q after the 3D Convolution is fixed by padding the output with zeros in the two spatial dimensions. A graphical representation of the process is shown in Fig. 3.

Dataset
For this study, 11 videos from 6 patients undergoing ureteroscopy procedures were collected. Videos from five patients were used for training the model and tuning hyperparameters. Videos from the remaining patient, randomly chosen, were kept aside and only used for evaluating the performance. The videos were acquired from the European Institute of Oncology (IEO) at Milan, Italy following the ethical protocol approved by the IEO and in accordance with the Helsinky Declaration. The number of frames extracted and manually segmented by video is shown in Table 1. Data augmentation was implemented before starting the trainings. The operations used for this purpose were rotations in intervals of 90 • , horizontal and vertical flipping and zooming in and out in a range of ± 2% the size of the original image.

Training Setting
All the models were trained, once at time, at minimizing the loss function based on the Dice Similarity Coefficient (L DSC ) defined as: where T P (True Positives) is the number of pixels that belong to the lumen, which are correctly segmented, F P (False Positives) is the number of pixels miss-classified as lumen, and F N (False Negatives) is the number of pixels which are classified as part of lumen but actually they are not.
For the case of (m1) the hyperparameters learning rate (lr) and mini batch size (bs) were determined using a 5-fold cross validation strategy with the data from patients 1, 2, 3, 4 and 6 in a grid search. The ranges in which this search was performed were lr = {1e − 3, 1e − 4, 1e − 5, 1e − 6} and bs = {4, 8, 16}. The DSC was set as the evaluation metric to determine the best model for each of the experiments. Concerning the extensions M , the same strategy was used to determine the number of kernels of the input 3D convolutional layer. The remaining hyperparameters were set the same as for m 1 .
In case of m 2 , the same 5-fold cross validation strategy was used. The hyperparameters tuned were: the backbone (from the options ResNet50 and ResNet101 [15]) and the value of minimal detection confidence in a range of 0.5 to 0.9 with differences of 0.1. To cover the range of different sizes of masks in the training and validation dataset the anchor scales were set to the values of 32, 64, 128 and 160. In this case the number of filters in the initial 3D convolutional layer was set to a value of 3 which is the only one that could match the predefined input-size, after reshaping, of ResNet backbone.
For each core models and their respective extensions, once the hyperparameters values were chosen, an additional training process was carried out using these values in order to obtain the final model. The training was performed using all the annotated frames obtained from the previously mentioned 5 patients, 60% of the frames were used for training and 40% for validation. The results obtained in this step were the ones used to calculate the ensemble results the function defined in Eq. 1.
The Networks were implemented using Tensorflow and Keras frameworks in Python 3.6 trained on a NVIDIA GeForce RTX 280 GPU.

Performance Metrics
The performance metrics chosen were DSC, Precision (P rec) and Recall (Rec), defined as:

Ablation study and comparison with sate-of-the-art
First, the performance of the proposed method was compared with the one presented in [9], where the same U-Net based on residual blocks architecture was used. Then, as ablation study, four versions of the ensemble model were tested: In these cases, the ensemble function was computed using the values of the predictions of each of the models. The Kruskal-Wallis test on the DSC was used to determine the statistical significance between the different single models tested.

Results
The box plots of the P rec, Rec and the DSC are shown in Fig. 4. Results of the ablation study are shown in Table 2. The proposed method achieved a DSC value of 0.80 which is 8% better than m 1 using single frames (p < 0.01) and 3% than m 2 trained as well with single frames (p < 0.05). When using single-frame information, m 2 performs 5% better than m 1 . However the results is the opposite using multi-frame information. The ensembles of singleframe models (m 1 , m 2 ) performs 7% better with respect to ensembles of models exploiting multi-frame information (M 1 , M 2 ). In the case of spatio-temporal-based models U-Net based on residual blocks (M 1 ) performs 3% better than the one based

Discussion
The proposed method achieved satisfactory results, outperforming existing approaches for lumen segmentation [9]. Quantitative evaluation, together with a visual inspection of the obtained segmentations, highlight the advantage of using ensembles, confirming our research hypotheses. This is particularly appreciable in presence of occlusions such as blood or dust covering the FOV (Fig. 5 rows 4-7). In those cases, single-frame-based models tended to include non-lumen regions in the predicted segmentation. An opposite behavior was observed when using only multi-frame-based models, which tended to predict smaller regions with respect to the groundtruth and which is also noticeable in the general performances carried during the ablation studies ( Table 2). The ensemble of all of them resulted, instead, in a predicted mask closer to the ground-truth and exemplifies why the use of it in general turns into better performances. It was also observed that the proposed ensemble method was able to correctly manage undesirable false positives appearing in single models. This is due the fact that those false positives did not appear in all the models at the same regions, therefore, the use of ensembles eliminate them from the final result. This is of great importance in the clinical practice, given that false positive classifications during endoluminal inspection might results in a range of complications of the surgical operation, including tools colliding with tissues [17], incorrect path planning [18], among others. Despite the positive results achieved by the proposed approach, some limitations are worth to be mentioned. Computational time required for inference is one of those. In terms of inference time, the proposed model requires 4 times more than previous implementations. However, it is important to state that when it comes to applications of minimal invasive surgery, accuracy may be preferred over speed to avoid any complication, such as perforations of the ureter [5]. Furthermore, such time could be improved by taking advantage of distributed parallel set-ups. A final issue is related to the scarcity of public available and annotated data, necessary to train and benchmark, which is a well-known problem in literature. However, this can be overcome in future as new public repositories containing spatial-temporal data are released.

Conclusion
In this paper, we introduced a novel ensemble method for ureter's lumen segmentation. Two core models based on U-Net and Mask-RCNN were exploited and extended, in order to capture both single-frame and multi-frame information. Experiments showed that the proposed ensemble method outperforms previous approaches for the same tasks [9], by achieving an increment of 7% in terms of DSC. In the future, evident extensions of the present work will be investigated, including better methods to fit spatial-temporal data into models which were pre-trained in single image datasets (such as Mask-RCNN). Furthermore, we will investigate methods for decreasing the inference time, thus allowing real-time applications.