Dataset
For this study, 11 videos from 6 patients undergoing ureteroscopy procedures were collected. Videos from five patients were used for training the model and tuning hyperparameters. Videos from the remaining patient, randomly chosen, were kept aside and only used for evaluating the performance. The videos were acquired from the European Institute of Oncology (IEO) at Milan, Italy, following the ethical protocol approved by the IEO and in accordance with the Helsinki Declaration.
The number of frames extracted and manually segmented by video is shown in Table 1. Data augmentation was implemented before starting the trainings. The operations used for this purpose were rotations in intervals of 90\(^{\circ }\), horizontal and vertical flipping and zooming in and out in a range of ± 2\(\%\) the size of the original image.
Training setting
All the models were trained, once at time, at minimizing the loss function based on the Dice similarity coefficient (\(L_\mathrm{DSC}\)) defined as:
$$\begin{aligned} L_\mathrm{DSC} = 1- \frac{2 {\mathrm{TP}}}{2\mathrm{TP} + \mathrm{FN} + \mathrm{FP}} \end{aligned}$$
(2)
where true positive (TP) is the number of pixels that belong to the lumen, which are correctly segmented, false positive (FP) is the number of pixels miss-classified as lumen, and false negative (FN) is the number of pixels which are classified as part of lumen but actually they are not.
For the case of (m1), the hyperparameters learning rate (lr) and mini batch size (bs) were determined using a fivefold cross-validation strategy with the data from patients 1, 2, 3, 4 and 6 in a grid search. The ranges in which this search was performed were \(lr = \lbrace 1e{-}3, 1e{-}4, 1e{-}5, 1e{-}6 \rbrace \) and \(bs = \lbrace 4,8,16 \rbrace \). The DSC was set as the evaluation metric to determine the best model for each of the experiments. Concerning the extensions M, the same strategy was used to determine the number of kernels of the input 3D convolutional layer. The remaining hyperparameters were set the same as for \(m_1\).
In case of \(m_2\), the same fivefold cross-validation strategy was used. The hyperparameters tuned were: the backbone (from the options ResNet50 and ResNet101 [15]) and the value of minimal detection confidence in a range of 0.5–0.9 with differences of 0.1. To cover the range of different sizes of masks in the training and validation dataset, the anchor scales were set to the values of 32, 64, 128 and 160. In this case, the number of filters in the initial 3D convolutional layer was set to a value of 3 which is the only one that could match the predefined input-size, after reshaping, of ResNet backbone.
For each core models and their respective extensions, once the hyperparameters values were chosen, an additional training process was carried out using these values in order to obtain the final model. The training was performed using all the annotated frames obtained from the previously mentioned 5 patients, 60\(\%\) of the frames were used for training and 40\(\%\) for validation. The results obtained in this step were the ones used to calculate the ensemble results the function defined in Eq. 1.
The networks were implemented using Tensorflow and Keras frameworks in Python 3.6 trained on a NVIDIA GeForce RTX 280 GPU.
Performance metrics
The performance metrics chosen were DSC, precision (Prec) and recall (Rec), defined as:
$$\begin{aligned} \mathrm{DSC}= & {} 1 - L_\mathrm{DSC}\end{aligned}$$
(3)
$$\begin{aligned} \mathrm{Prec}= & {} \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\end{aligned}$$
(4)
$$\begin{aligned} \mathrm{Rec}= & {} \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \end{aligned}$$
(5)
Ablation study and comparison with state of the art
First, the performance of the proposed method was compared with the one presented in [9], where the same U-Net based on residual blocks architecture was used. Then, as ablation study, four versions of the ensemble model were tested:
-
1.
(\(m_1\),\(m_2\)): only single-frame information was considered in the ensemble;
-
2.
(\(M_1\),\(M_2\)): only multi-frame information was considered in the ensemble;
-
3.
(\(m_1\),\(M_1\)), (\(m_2\),\(M_2\)): each of the core models and its respective extension were considered in the ensemble, separately.
In these cases, the ensemble function was computed using the values of the predictions of each of the models. The Kruskal–Wallis test on the DSC was used to determine the statistical significance between the different single models tested.