Point detection through multi-instance deep heatmap regression for sutures in endoscopy

Purpose: Mitral valve repair is a complex minimally invasive surgery of the heart valve. In this context, suture detection from endoscopic images is a highly relevant task that provides quantitative information to analyse suturing patterns, assess prosthetic configurations and produce augmented reality visualisations. Facial or anatomical landmark detection tasks typically contain a fixed number of landmarks, and use regression or fixed heatmap-based approaches to localize the landmarks. However in endoscopy, there are a varying number of sutures in every image, and the sutures may occur at any location in the annulus, as they are not semantically unique. Method: In this work, we formulate the suture detection task as a multi-instance deep heatmap regression problem, to identify entry and exit points of sutures. We extend our previous work, and introduce the novel use of a 2D Gaussian layer followed by a differentiable 2D spatial Soft-Argmax layer to function as a local non-maximum suppression. Results: We present extensive experiments with multiple heatmap distribution functions and two variants of the proposed model. In the intra-operative domain, Variant 1 showed a mean F1 of +0.0422 over the baseline. Similarly, in the simulator domain, Variant 1 showed a mean F1 of +0.0865 over the baseline. Conclusion: The proposed model shows an improvement over the baseline in the intra-operative and the simulator domains. The data is made publicly available within the scope of the MICCAI AdaptOR2021 Challenge https://adaptor2021.github.io/, and the code at https://github.com/Cardio-AI/suture-detection-pytorch/. DOI:10.1007/s11548-021-02523-w. The link to the open access article can be found here: https://link.springer.com/article/10.1007%2Fs11548-021-02523-w


Introduction
Mitral valve repair is a surgery of the mitral valve of the heart, that seeks to restore its function by reconstructing the valvular tissue. In this surgery, a prosthetic ring is affixed to the mitral valve, by first suturing around the annulus of the valve and then implanting the ring of a chosen size, through the sutures onto the annulus [3]. Mitral valve repair is increasingly performed in a minimally invasive manner [4], with a reliance on image guidance, in particular endoscopic video for the reconstruction process.
Besides, the use of surgical simulators are becoming more popular in training and familiarizing the surgeons with the demanding surgical techniques. In our previous work, we showed how to simulate endoscopic surgeries on a patient-individual basis with the help of flexible 3D-printed mitral valve replica [11,10]. The endoscopic data stream obtained during surgery or such simulations can be analysed in real-time or retrospectively to extract quantitative information with regards to patient-valve geometry [21] or context-aware visualisations.
In particular, suture detection is one such task that can provide quantitative information. This information can then be used to analyse the suturing patterns, examine the correlation with different levels of expertise, understand the optimal suture configuration in the context of ring implantation, and to use the suture locations to create augmented reality visualisations [8]. The task of suture detection entails detecting the entry and exit point of the sutures around the annulus. More precisely, given an image and the corresponding suture locations for this image, the task is to predict the suture locations for unseen endoscopic images. There have been multiple approaches from the literature in the field of landmark detection, more commonly in facial landmark detection [26], pose estimation [2,16] and medical landmark detection [28,23] to tackle this problem. However, there are two important distinctions due to which these approaches are not directly applicable to our task. Firstly, for any given image there exists a variable number of sutures, unlike a fixed number of facial or anatomical landmarks. Secondly, the points have a semantic meaning and are of single-instance. This renders fixed regression-based approaches ill-suited to our task. Additionally, commonly used patch-based refinement of anatomical landmarks are inapt in this scenario.
In this approach, we seek to solve a multi-instance detection problem for 2D points. The approach is based on a heatmap which typically models the distribution of likelihood around the point as a Gaussian. In this paper, we extend on our previous conference work [24], where we proposed a simple U-net with a single channel output. The output map was thresholded and the center of mass was calculated for each region to determine the final position of points.
In the work at hand, we introduce the usage of a differentiable Gaussian filter and a Soft-Argmax layer to enforce both, learning of the heatmap and further extracting the points from the heatmap, in a differentiable manner. We demonstrate an improvement compared to our previous baseline, and additionally perform experiments comparing various final layer configurations and loss functions. The approach is evaluated on two different domains, i.e. video data from simulated surgery and from real procedures are used. The data is made publicly available within the scope of the MICCAI AdaptOR2021 Challenge https: //adaptor2021.github.io/. To be consistent with the challenge, we used the same data set split like in the challenge, which is slightly different from [24].

Regression-based approaches
Discriminative approaches to landmark detection comprise of regression or heatmap based methods. Regression based methods directly estimate the landmark coordinates from the image. Duffner and Garcia [6] is one of the early works using neuronal layers to estimate facial feature positions. Subsequently, due to the exponential growth of deep learning tools and techniques, there have been a number of works that estimate this mapping with a neural network [26]. The regression based methods model a mapping from the image space to the coordinate space, which is highly non-linear and therefore difficult to learn. The approaches from the literature tackle this problem by using a CNN cascade or progressive refinement of the landmarks [13,27].
In the field of medical imaging, landmark localization is a relevant step for tasks such as registration and augmented reality visualisations. Cascaded and stage-wise models [25,28] are typically used for anatomical landmark regression. Sofka et al. [23] presented a landmark regressor for ultrasound image sequences, that additionally imposes a temporal constraint with LSTM cells along with a center-of-mass layer to extract landmark locations. However, all of the aforementioned methods predict a pre-defined number of landmarks, which allows to pre-determine the shape of the tensor to be regressed in the output layer of a network. Additionally, learning a transformation from the image space to the coordinate space is a highly non-linear mapping which is further complicated by the variations in camera view, pose, illumination and scene composition.

Heatmap-based approaches
Unlike regression based approaches that directly regress on the coordinates, heatmaps model a distribution of likelihood around the points of interest. In the recent years, the field of landmark detection is moving towards the use of heatmap based approaches [26], as they model a mapping from image-to-image space unlike regression models. A Gaussian distribution is commonly used to model the likelihood of the landmark locations. Typically, the heatmap approaches represent a single landmark in one channel, making it easy to perform differentiable operations or post-processing [5]. Bulat and Tzimiropoulos [2] proposed a two-stage network to regress on heatmaps and further finetune the landmarks in subsequent stages. The Deep Alignment Network (DAN) [17] processes the whole image in contrast to patches and finetunes the landmark estimates using heatmaps. In the medical domain, Zhang et al. [29] used a multi-task network to learn displacement maps using heatmaps. Payer et al. [18] proposed a two stage heatmap based network. All in all, similar to the regression approaches, the heatmap based regression networks, model one heatmap per channel with a pre-defined set of landmarks.
In earlier works on a small intra-operative dataset [8,9], we have used random forests and tailored post-processing for point detection and optical flow for point tracking. Our previous work on the same data base [24] formulates the landmark detection task as a deep learning based approach, and demonstrates first results on intra-operative and surgical simulator datasets for heart surgeries. Hervella et al. [15] demonstrated a similar method for the case of retinal fundus images. However, the model has to learn from a heavily unbalanced dataset due to the nature of point landmarks in the context of a segmentation task. Brosch et al. [1] tackles this problem by using a novel objective function. In this paper, we extend our previous work [24] and tackle the unbalanced multi-instance sparse-segmentation task through the use of a differentiable convolutional Soft-Argmax layer combined with a balanced loss function. Iqbal et al. [16] used a differentiable Soft-Argmax layer to extract the landmark locations from the heatmap, but the problem formulation contained a single heatmap per channel for a pre-defined number of heatmaps. Chandran et al. [5] used a heatmap combined with the differentiable Soft-Argmax layer to extract regions of interest to provide a global context to landmark localization. In contrast to Iqbal et al. [16] and Chandran et al. [5], our approach uses a convolutional Soft-Argmax layer that is convolved spatially with the feature map, in order to impose stability in modeling a distribution and extracting multiple instances of landmarks from this distribution. Additionally, an F β loss is used to take into account the precision and recall for optimisation. Results compared to our previous baseline [24], along with ablations with various loss functions and output layer configurations, are presented.

Task Formulation
The labels s i ∈ S, i ∈ 1, 2....N can be equivalently represented as a binary mask, where p (x i ,y i ) ∈ {0, 1}, x i ∈ 1, 2...., H, y i ∈ 1, 2...., W , denotes the pixel value at image location (x i , y i ), with a value of 1 in the suture locations and 0 otherwise. Alternatively, the position of each suture instance can be modelled by a distribution centered around the location that represents the likelihood of the pixel being a landmark location. In this case, p (x i ,y i ) takes values in [0, 1]. In this work, we present and compare two different distribution functions to model the heatmap, namely the Gaussian and the Tanh distribution. In the case of the Gaussian distribution, the spread is controlled by the variable σ 1 , which we set to σ 1 = 1, 2, and 3. The Tanh distribution is a sharper distribution, where we set the variable α = 3.5 × σ that controls the spread of the distribution. We experiment with the values of α = 7 and α = 10.5. An illustration comparing the Gaussian and Tanh distribution is provided in Fig. 1.

Network Architecture
The labelsŷ j for unseen endoscopic images can be estimated by a neural network φ(x j , y j , θ) with parameters θ. A U-Net [20] based architecture is used, similar to the one described in our previous work [24], but using ReLU activations for the convolutional layers. RGB 3-channel images of size 288 × 512 are provided to the model as input. A mask of the same size is created from the labelled suture locations. A distribution function centered around each suture point is applied to the binary mask as described in Section 3.1.
Furthermore, a differentiable Gaussian filter with spread σ 2 is applied to the output of the Sigmoid layer. Different values of σ 1 and σ 2 are applied and a comparison is presented in Section 5.1. A similarity loss L 1 between the filtered output (Output Stage 1 in Fig. 2 (a) and (b)) and the ground-truth heatmap is applied, that enforces the model to learn the likelihood distribution of the suture locations. The Gaussian filter encourages the model to learn a smooth distribution around the predicted locations. Additionally, the filtered output is fed through a differentiable convolutional 2D spatial Soft-Argmax layer to produce the final output of the model (Output Stage 2 in Fig. 2 (a) and (b)). In this layer, a Soft-Argmax kernel of size (3×3), with a stride of 1 and a padding of 1 is convolved with the output from the previous layer. The layer is implemented using the Kornia [19] library. An additional similarity loss L 2 is applied at this stage, as illustrated in Fig. 2 (a) and (b). Other works [16,5] demonstrated the use of a differentiable Soft-Argmax layer in extracting the landmarks from the heatmaps, where a single landmark is used per channel of the heatmap and as a result, the Soft-Argmax layer yields the landmark locations. In contrast, we represent all suture locations in a single channel. Therefore, the convolutional Soft-Argmax layer functions as a local non-maximum suppression for the points with low likelihood of being a suture point.
At the output of stage 1, a loss function of the form L 1 = M SE + 1 − SDC is applied between the predicted suture points and the ground-truth heatmap, where M SE is the Mean Squared Error and SDC is the Sørensen Dice Coefficient. For the model variant 1 at the Output Stage 2 as shown in Fig. 2 (a), the same loss L 2 = M SE + 1 − SDC is applied for a heatmap ground-truth. For the model variant 2 at the Output Stage 2 as shown in Fig. 2 (b), the output is optimised with a binary ground-truth mask. In this case, the true and false pixel classes are even more unbalanced. An asymmetric similarity loss function that can weight the precision and recall, is previously shown to perform better for unbalanced classes [14], in comparison to the dice coefficient. We therefore apply a balanced F β loss function for the predicted and ground-truth binary masks where p i is the likelihood of a pixel being a suture, and the binary label g i ∈ {0, 1} denotes the presence of a suture point. A value of β = 2 is used, to penalise  the number of false negatives. The final suture detection model, jointly optimises the loss functions L 1 and L 2 , given by The predicted heatmaps that are obtained after evaluation are thresholded with t = 0.5, and the center-of-mass of the local clusters are computed, to extract the predicted suture coordinates, that are then evaluated with the labeled suture coordinates.

Evaluation
A suture point is considered to be successfully predicted, if the distance between the predicted and ground-truth point is less than 6 pixels, as proposed in [24]. If multiple points are matched with the ground-truth, then the point closest to the ground-truth is chosen as the predicted point. Once the closest match is allocated the second-best match is used to match other labels in the image. An illustration of this is shown in Figure 3 (a). From these predicted co-ordinates, the number of True Positives (TP), False Positives (FP), and False Negatives (FN) are determined. Furthermore, the Positive Predicted Value (PPV) is computed as P P V = T P/(T P + F P ) and the True Positive Rate (TPR) is determined as T P R = T P/(T P + F N ). In order to compare the performance of two different models, the F 1 score is computed by taking the harmonic mean of P P V and T P R as F 1 = (P P V ×T P R)/(P P V +T P R). Additionally in this work, we compute the root mean square error of the Euclidean distance as a localisation metric. For each predicted point in an image, the Euclidean distances to all ground-truth points are computed and the least distance is chosen. This is then averaged across all predicted points in an image, for the images that have predictions. For images without predicted points, the metric cannot be computed, and we therefore additionally report the number of images where this occurs in Table 3. For RM SE computation, we consider each predicted point in the calculation, i.e. each point has a match. Points which are further away without a match would otherwise not be penalized in the metric.

Datasets
In this work, datasets from two domains are used for the experiments, namely: the intra-operative, and the surgical simulator domain. The data in the intra-operative domain comes from endoscopic frames captured during mitral valve repair surgery. The intra-operative data forms a heterogeneous dataset comprising of frames with widely varying camera view, scale, lighting sources, and white balance. Additionally, the dataset also contains the presence of endoscopic artefacts caused due to occlusion and specular reflections. The intra-operative datasets are split on a surgery level, like in the mentioned Challenge. Firstly, a dataset for cross-validation (A.1) is created, comprising of 4 surgeries, with a total of 2, 376 images. It is important to note, that the validation set is not used for fine-tuning the model performance, for example, using early stopping. Therefore, the validation set functions as an unseen dataset for the model. Additionally, a second independent intra-operative test set is created (A.2). The data in the surgical simulator domain comes from endoscopic capture of the surgical training and planning sessions on the mitral valve silicone replica models [10]. A simulator dataset (B.1) is created from 10 such simulator sessions, with a total of 2, 708 images. The endoscopic videos are captured in a top-down stereo format, after which relevant frames are extracted. The left and right images of the stereo pair are treated independently. The suture points identified in these frames are then manually labeled using the annotation tool labelme. For further details about the endoscopic capture of the data, the reader is referred to our previous work [24,22]. For this work, the previously annotated suture point labels are revisited and quality-checked. After correction, the intra-operative cross-validation dataset contains a total of 23, 938 sutures, and the simulator dataset a total of 33, 872 sutures.The data is released within the scope of the AdaptOR2021 MICCAI challenge at https://adaptor2021.github.io/ The endoscopic frames are resized to a resolution of 288 × 512, and image and mask augmentations are applied, before feeding to the model. Similar to our previous work [24], the dataset was augmented with vertical and horizontal flip, rotation of ±60 • , and affine translations with ±10%. Additionally, image augmentations comprising of pixel shifting in range ±1%, shearing in range ±0.1, brightness in range ±0.2, contrast in range from 0.3 to 0.5, random saturation in range from 0.5 to 2.0, and hue in range of ±0.1 are applied. All augmentations are applied with a probability of 50%.

Experiments
Firstly, in the intra-operative domain, a 4-fold cross-validation is performed on dataset A.1. Additionally, the models are evaluated on an independent test set A.2. In the simulator domain, a 5-fold cross-validation is performed on the dataset B.1. Our baseline results were presented in [24] for a Gaussian heatmap with a σ 1 = 1. Here, we recompute the baseline for σ 1 ∈ 1, 2, 3, with the refined labels and the data split as described in Section 4.1. We present a comparison of the model variants described in Section 3.2 with the baseline. Furthermore, we perform a sensitivity analysis with different parameters of the heatmap distribution, namely Gaussian with σ 1 ∈ 1, 2, 3, 4, σ 2 ∈ 1, 2, 3 and Tanh with α ∈ 7, 10.5. Here, the effect on the model performance in relation to varying σ 1 and σ 2 are presented. Moreover, we present an evaluation with a localisation metric, as described in Section 3.3. Finally, we present a comparison of the evaluation with 3 different radii around the ground-truth point, namely 6, 8, and 10 pixels, for the bestperforming model in each domain. The network is trained with a learning rate of 0.001, with an Adam optimizer. A learning rate decay scheme is used to reduce the rate by a factor of 0.1 upon a plateau for 10 epochs. The models were trained on one of NVIDIA Quattro P6000, NVIDIA TITAN RTX, or NVIDIA TITAN V. The PyTorch library was used to implement the model pipeline.

Results
The results of the 4-fold cross-validation in the intra-operative domain on dataset A.1 are presented in Table 1 (a). Additionally, an evaluation of two variants of the proposed model in comparison with the baseline from our previous work [24], on the intra-operative test set A.2 is shown in Table 1 (b). In the simulator domain, results of the 5-fold cross-validation are presented in Table 1 (c). Samples from prediction from cross-validation in the intra-operative (A.1) and the simulator domain (B.1) are shown in Fig. 5.
Firstly, from the cross-validation in the intra-operative dataset (A.1), it can be seen that in comparison with the our previous baseline with the value of σ 1 = 1 [24], the performance of the model increases with σ 1 = 2. For both the values of σ 1 = 2, σ 1 = 3, the model performs better than while using a Tanh distribution, with respective values of α = 7, α = 10.5 (mean F 1 +0.0082 for σ 1 = 3, α = 10.5 OR A.1 c.f. Table 1 (a)). To recall, σ 1 here denotes the spread of the Gaussian distribution used to create the masks. σ 2 refers to the parameters of the local differentiable Gaussian layer used in the proposed model variants.
In the intra-operative dataset, the Variant 1 of the proposed model with σ 1 = 3 outperforms the baseline from our previous work [24] with σ 1 = 1 (mean F 1 +0.0082 for σ 1 = 3 OR A.1 c.f. Table 1 (a)). Variant 1 also outperforms the baseline model with the same σ 1 value of 3 (mean F 1 +0.0080 for σ 1 = 3 OR A.1 c.f. Table 1 (a)). In this case, the difference is the differentiable local Gaussian and SoftArgMax layers in the model architecture. A larger spread of the Gaussian distribution provides more likelihood values around every landmark and additionally reduces the imbalance of the pixels in the dataset, thereby helping the model learn better. However, a larger spread around the suture point also means that the model is prone to confounding from nearby points due to overlapping distributions. Similarly, in the Simulator domain, the proposed model Variant 1 with σ 1 = 2 outperforms the baseline from our previous work [24] (mean F 1 +0.0865 Sim B.1 c.f. Table 1 (c)), with σ = 1, and the baseline model with the same value of σ 1 = 2 (mean F 1 +0.0354 Sim B.1 c.f. Table 1 (c)).
(a) Cross-validation on OR data (A.1) (Higher is better) Experiment Mask distribution P P V T P R F 1 Baseline [24] Gauss, σ = 1 62.2700 ± 9.54 33.1350 ± 6.68 0.4376 ± 0.06 Baseline [24] Gauss, σ = 2 64.4450 ± 11.01 37.6525 ± 8.46 0.4720 ± 0.05 Baseline [24] Gauss, σ = 3 65.8775 ± 8.95 37.8200 ± 10.58 0.4718 ± 0.08 Baseline [24] Tanh, α = 7 69.2850 ± 6.42 34.6900 ± 5.74 0.4494 ± 0.05 Baseline [24] Tanh, α = 10. In the intra-operative domain, Variant 2 of the proposed model does not outperform the baseline with the corresponding σ 1 value (mean F 1 −0.0144 for σ 1 = 3 OR A.1 c.f. Table 1 (a)). In the simulator domain however, the Variant 2 outperforms the corresponding baseline (mean F 1 +0.0324 Sim B.1 c.f. Table 1 (c)). Binary masks in this case, without a likelihood distribution constitute a highly imbalanced dataset, which hampers the learning process and affects performance. In both domains, the model Variant 1 yields the best performing model. Furthermore, for values σ 1 = 2, σ 1 = 3, the values of σ 2 are varied between 1, 2, and 3 and the results are presented in Table 2. In each domain, the best performing model is with the value σ 2 = 1. In both the cases of intra-operative and the simulator domains, there is a best-performing value of (σ 1 , σ 2 ) after which the performance of the model drops. In the case of the intra-operative domain, this performance occurs at (σ 1 = 3, σ 2 = 1) and in the case of the simulator domain, at (σ 1 = 2, σ 2 = 1). This is due to the trade-off that occurs while increasing the spread of the distribution around the suture points. In order to understand this trade-off, the model performance is analysed at the level of two different subsets. Firstly, a subset of close-points are defined as the points that are within a distance of 15 pixels within each other. The rest of the points are categorised as non-close points. Then, the change in the True Positive points, as we go from σ 1 = 2 to σ 1 = 3 is analysed. An example illustration in the simulator domain is shown in Figure 4. It can be seen that the drop in the percentage of True Positives is higher in the case of the close subset in comparison to the points that are not located close to each other.  Moreover, we compute the root mean square error of the Euclidean distance as explained in Section 3.3, the results of which are presented in Table 3. As can be seen from Table 3, the results are different as compared to the F 1 score metric presented in Table 2. Although the RM SE distance provides an indication of the closeness of the points to the ground truth labels, it is difficult to analyse a case where the RM SE of two models are the same despite one of the models predicting more False Positives, since the metric is averaged over each predicted point. An example of this is shown in Figure 3 (b). Finally, we present an evaluation with 3 different radii around the ground-truth point for which a match is allocated, namely 6 pixels, 8 pixels, and 10 pixels, for the best-performing model in each domain, as can be seen in Table 4.

Discussion
In this paper, as an extension to our previous work [24], we tackle the suture detection task by introducing a differentiable 2D Gaussian filter layer, and an additional differentiable convolutional 2D spatial convolutional Soft-Argmax layer.   Unlike other works [16,5], that use a Soft-Argmax layer to directly extract the landmarks from the heatmap from a single channel, we present its use as a form of local non-maximum suppression to filter out points with low likelihood of being a suture. Firstly, we perform experiments comparing the baseline from our previous work [24], with different values of σ 1 . Here, we also present comparison of the Gaussian distribution with a Tanh distribution with a similar spread. Then we present two variants of our proposed model in comparison with the baseline (c.f. Table 1). Further, we present experiments by varying values of σ 1 ∈ 1, 2, 3, 4 and σ 2 ∈ 1, 2, 3 (c.f. Table 2). In addition to the evaluation with the F 1 score, we compute an RM SE metric (c.f. . The intra-operative dataset is a highly heterogeneous dataset comprising of images from different viewing angles, scale, light sources, and white balance. Furthermore, the intra-operative datasets contain endoscopic artefacts caused due to specularities, and occlusions from tissue or surgical instruments in the scene which make it a challenging dataset to learn from. Finally, it is often the case that two sutures are stitched close to each other. This makes it further difficult for the model, and a human reader, to distinguish nearby sutures. In particular, the final 2D Gaussian filter layer and the convolutional 2D spatial Soft-Argmax layer operate locally with a window and are prone to be confounded by closely occurring suture points. This is especially true in the case of higher Gaussian σ 1 values, as can be seen in Table 2. Varying the values σ 1 and σ 2 each have an effect on model performance in relation to the number of points in the dataset that are nearby or farther away from each other. In this regard, an adaptive variation in the Gaussian distribution is a potential future work, to handle these variations.
Besides providing quantitative information for analysis of endoscopic data, the learned representations from the suture detection task can also be used to support other learning objectives. In particular, this task is relevant in the context of generative models to transform data from the simulator to the intra-operative domain [7,12]. In our recent work [22], we show that suture detection models can be used to mutually improve generative domain transformation in endoscopy.

Conclusion
In this work, we tackle the task of suture detection for endoscopic images by formulating it as a multi-instance sparse heatmap-regression problem. We extend our previous work [24] and improve upon the previously reported baselines. We introduce a novel Gaussian filter layer and a differentiable convolutional Soft-Argmax layers. We compare multiple distribution functions and present two variants of the model that outperform the baselines. Suture detection is an important task that can further be used towards supporting the learning of related objectives for endoscopic image analysis.
Conflict of interest: All authors declare that they have no conflict of interest. The data is released within the scope of the AdaptOR2021 MICCAI challenge at https://adaptor2021.github.io/. Ethics Approval: The study was approved by the Local Ethics Commitee from University Hospital Heidelberg. Registration numbers are S-658/2016 (12.09.2017) and S-777/2019 (20.11.2019). The study was performed in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. Consent to participate Informed consent was obtained from all individual participants included in the study. Consent for publication Not applicable Code availability The authors plan to make the code publicly available in the subsequent months.