Learning-based keypoint registration for fetoscopic mosaicking

Purpose In twin-to-twin transfusion syndrome (TTTS), abnormal vascular anastomoses in the monochorionic placenta can produce uneven blood flow between the two fetuses. In the current practice, TTTS is treated surgically by closing abnormal anastomoses using laser ablation. This surgery is minimally invasive and relies on fetoscopy. Limited field of view makes anastomosis identification a challenging task for the surgeon. Methods To tackle this challenge, we propose a learning-based framework for in vivo fetoscopy frame registration for field-of-view expansion. The novelties of this framework rely on a learning-based keypoint proposal network and an encoding strategy to filter (i) irrelevant keypoints based on fetoscopic semantic image segmentation and (ii) inconsistent homographies. Results We validate our framework on a dataset of six intraoperative sequences from six TTTS surgeries from six different women against the most recent state-of-the-art algorithm, which relies on the segmentation of placenta vessels. Conclusion The proposed framework achieves higher performance compared to the state of the art, paving the way for robust mosaicking to provide surgeons with context awareness during TTTS surgery. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-023-03025-7.


Introduction
Twin-to-twin Transfusion Syndrome (TTTS) is a rare complication affecting 10-15% of monochorionic diamniotic pregnancies.TTTS is characterized by the develop- ment of unbalanced and chronic blood transfer from one twin (the donor) to the other (the recipient), through placental communicating vessels called anastomoses (Baschat et al., 2011).This shared circulation causes profound fetal hemodynamic unbalance and consequently severe growth restriction, cardiovascular dysfunction, hypoxic brain damage and death of one or both twins (Lewi et al., 2013).
The recognized elective treatment for TTTS is selective laser photocoagulation of anastomoses originating in the donor's placental territory.This treatment requires precise identification and laser ablation of placental vascular anastomoses (Beck et al., 2012).Despite recent advancements in instrumentation and imaging for TTTS (Cincotta and Kumar, 2016), residual anastomoses after ablation still represent a major complication (Lopriore et al., 2007).This may be explained considering the challenges, from the surgeon's side, of limited Field of View (FoV) and constrained manoeuvrability of the fetsocope, especially for anterior placenta.In this complex scenario, Computer-Assisted Intervention (CAI) and Surgical Data Science (SDS) methodologies (Maier-Hein et al., 2022) may be exploited to provide surgeons with mosaicking for FoV expansion.
An approach to mosaicking relying on external devices is proposed in Tella-Amo et al. (2016).However, external devices may not be always used in the operating room due to current regulations.Currently, researchers are focusing on methods to perform mosaicking using only fetoscopy images.Several challenges in endoscopic images hamper the translation of previously developed methods into the actual surgical practice, as reported by Kennedy-Metz et al. (2021).These challenges include poor visibility due to amniotic fluid turbidity, low-resolution of fetoscopic images, occlusions by surgical tools and fetus (Fig. 1 (a)), lack of anatomical structures (Fig. 1 (b, d, f)) to be used as reference for frame registration, poor frame texture (Fig. 1 (c)) and distortion introduced by non-planar views due to fetoscope camera orientation, especially in case of anterior placenta (Fig. 1 (c, e)).Bano et al. (2021) First approaches to mosaicking using only fetoscopy images, include the work of Daga et al. (2016) and Reeff et al. (2006), who relied on standard descriptors, such as SIFT (Gutiérrez and Robles ( 2016)) and SURF (Bay et al. (2006)).These methods were validated on synthetic phantoms or ex-vivo placental sequences only, and may fail when processing in-vivo placenta frames, due to the lack of texture typical of intraoperative endoscopy images.The approach in Peter et al. (2018) follows a different approach, minimizing the photometric consistency between frames, showing promising results with in-vivo fetoscopy data.However, the computation time to process a frame pair is a major bottleneck and may not be compatible with real-time mosaicking.More recently, deep-learning algorithms have been proposed to tackle the challenges of fetoscopy frames.In Gaisser et al. (2018), stable regions manually identified in the frames are used as a prior for frame registration with a convolutional neural network.The approach is tested on phantoms only.The work in Bano et al. (2019) uses HomographyNet (DeTone et al., 2017) to perform pair-wise homography estimation, but the validation is performed on a single in-vivo sequence.
A recent and promising approach in the field, which is presented in Bano et al. (2020a,b), shows that placental vessels provide unique landmarks to compute homography.While obtaining accurate vessel segmentation might be considered an affordable challenge (Bano et al., 2021), the approach in Bano et al. (2020a,b) fails whenever vessels are not visible.
The challenges of in-vivo fetoscopy video analysis may explain why only few researchers have attempted to design algorithms for fetoscopy mosaicking so far.Last year, we organized the sub-challenge "FetReg: Placental Vessel Segmentation and Registration in Fetoscopy"1 , inside EndoVis, a MICCAI Grand Challenge.Only one team competed to the task "Placental vessel registration and RGB frame registration for mosaicking."Bano et al. (2022) With this work, we aim to contribute to the advancement of the state of the art in fetoscopy mosaicking by investigating, with a comprehensive study with 6 videos (14500 frames) acquired from 6 women during actual surgery, the following research hypotheses: • Hypothesis 1 (H1): Keypoint learning can tackle the challenges typical of fetoscopic videos acquired during TTTS surgery and provide robust keypoints for mosaicking without relying on the segmentation of structures in the FoV • Hypothesis 2 (H2): Mosaicking performance can be boosted by filtering irrelevant keypoints and rejecting inconsistent homographies.

Contribution
In this paper, we propose a learning-based framework for the robust detection of keypoints with the aim to register consecutive fetoscopy images acquired during TTTS surgery and accomplish fetoscopy mosaicking.Our framework does not rely on any structure segmentation for keypoint estimation.However, when fetus and surgical tools are in the FoV, their segmentation is used for irrelevant keypoint rejection.
The contributions of this work can be summarized as follows: 1. Development of a new framework for robust FoV expansion in TTTS fetoscopy videos, which features an intrinsic strategy for detecting keypoints robustly and filtering inconsistent homographies.2. Design of a self-supervised strategy for training the framework with unlabeled fetoscopy frames.3. Validation using the largest dataset in the field, which consists of 6 in-vivo TTTS videos from Bano et al. (2019).
To the best of our knowledge, this work is the first to investigate the potential of keypoint learning for fetoscopy mosaicking.We perform extensive comparison with the state of the art, as well as an ablation study to identify the best configuration of our framework.

Method
Our framework consists of improved Keypoint Proposal Network (KPN) (Sec.2.1) for keypoints learning keypoints and irrelevant keypoint rejection, and registration for mosaicking (Sec.2.2), which estimates homography from the keypoints and filters inconsistent homographies.The overall framework is shown in Fig. 2.

Keypoint proposal computation.
The KPN, is a convolutional neural network based on SuperPoint proposed by DeTone et al. (2018).SuperPoint is the learning-based keypoint detection network that exibits state-of-the-art performance on a large number of geometry problems in computer vision, including homography estimation, where groundtruth is not available.
KPN consists of a VGG-16 backbone for feature extraction, followed by two heads, the Keypoint Head (KH) for the detection of candidate keypoints, and the Descriptor Head (DH) for computing keypoint descriptors.KH outputs a dense point map, with the same size as the input frame, where the value of each pixel refers to the probability of that pixel of being a keypoint.DH outputs a L2-normalized descriptor vector for each candidate keypoint.

KPN training.
We train the KPN in four steps, taking inspiration from DeTone et al. ( 2018).As a first step, to account for the lack of annotated TTTS frames, we train KPN without DH on a synthetic dataset.We use a synthetic image-keypoint pairs obtained from the synthetic shapes dataset presented in DeTone et al. (2017).Each pair consists of images with size 448448 pixels containing simple polygons and associated keypoints.To increase the variability of the dataset, we apply during training (i) perspective distortions (i.e.homographic augmentation), to model different camera views, and (ii) brightness and contrast augmentation (i.e.photometric augmentation), to encode the intensity variability.
As a second step, we fine tune KPN, still without DH, on natural images from MS-COCO 2014 training dataset (Lin et al. (2014)).In this case, we follow a selfsupervised training strategy to account for the lack of keypoint annotation in the dataset.We infer the MS-COCO 2014 test dataset using the weights obtained in the first step.From each image in the dataset, a patch of 448448 pixels is randomly cropped and converted to grayscale.The estimated dense point map is used to generate the pseudoground truth.Also in this case, photometric and homography-based augmentation is applied on the fly.
The last two steps involve the TTTS dataset.In the third step, we infer a subset of our TTTS dataset using the weights learned at the second step to obtain the associated pseudo-ground truth.We use this pseudo-ground truth to fine-tune the KPN without DH.In the last step, we update the TTTS-image pseudo-ground truth using the weights obtained at the third step.This pseudo-ground truth is used to train the whole KPN on TTTS images.For homographic augmentation, we limit the parameter range to be consistent to fetoscope camera model.
For all steps, we use the loss function L KPI defined as: L   is the cross-entropy loss computed over the keypoint map generated by KH and its groundtruth, L   is the loss computed on the warped keypoint map generated by KH after image warping with a random homography, while L  (,  ) is the hinge loss between descriptors from the original image and those from the warped image weighted by the term . is adjusted during training to balance the effect L  (,  ) term that, especially in the first training epochs, has largely negative values.We noticed experimentally that the KPN finds keypoints also on structures, such as fetus and surgical tools, that are not relevant to model fetoscope movement.These keypoints could affect homography estimation negatively.To reject irrelevant keypoints, we filter out keypoint proposals that fall inside fetus and surgical tool segmentation masks.These masks are obtained using the U-Net with ResNet50 backbone model presented in Bano et al. (2021).

Homography estimation.
Assuming KPN to be robust, we design a simple frame-pair registration pipeline to achieve fast registration at low computational cost.We approximate registration as affine transformations, assuming that this can provide a reasonably good description of fetoscope camera movement, following considerations in Bano et al. (2020a).The homography of two consecutive frames is estimated using Levenberg-Marquardt optimisation.RANSAC is used to find keypoints that match affine transformation constraint.Other than RANSAC, we also evaluated the use of other homography estimation algorithms, such as PROSAC and MAGSAC.However, a comparison of these methods in a preliminary analysis did not show any signifcant difference in the mosaicking performance.

Inconsistent homography filtering.
We can assume that homography should not reflect large displacement, rotation or scaling, as we register consecutive frames.We take inspiration from Bano et al. (2019) to filter out any homography that does not reflect this assumption.We perform singular value decomposition on each estimated homography to extract rotation, scale and translation parameters.When one of these parameters exceeds a threshold defined experimentally, the second frame in the pair to be registered is discarded, and the registration with the next frame is attempted.This procedure is reiterated for five frames and, in case of failure, mosaicking computation ends.

Dataset
We validate our framework using an extended version of the dataset published in Bano et al. (2020a).The overall dataset consists of 1450 frame from 6 different in-vivo TTTS fetoscopy procedures.Main characteristics of the dataset are summarized in Table 3. Frame number and resolution varies from video to video.
Videos differ in terms of intra-operative environment, artifacts and lighting conditions.Two videos present anterior placenta.While in posterior-placenta videos, the scene can be considered nearly planar because a straight fetoscope is used, with anterior placenta, the surgeon rely on the 30-degrees fetoscope (Ahmad et al. (2020); Casella et al. (2021)).The resulting non-planar view adds further challenges to mosaicking, as introduced in Sec. 1.

Implementation details
Our framework is implemented in TensorFlow 1.15 and trained on two NVIDIA A100 40GB, using ADAM optimizer and a learning rate of 10 −3 .For training the KPN following the strategy described in Sec.2.1.2,in the first 3 training steps we set a batch size of 64, while a batch size of 8 is used for the last step.For the 4 steps, we set a number of iteration equal to 180000, 60000, 20000 and 12000, respectively.

Performance metrics
We measure the performance of our framework using the structural similarity index measure (SSIM) over a number () of frames, with  ∈ [1, 5], for fair comparison with Bano et al. (2020a).We call this metric .Given a source (  ) and a target ( + ) frame, and a homography transformation ( →+ ) between   and  + , for every -th frame in the TTTS sequence  is defined as: where sim() is the standard formula for SSIM,  is the warping operator, and Ĩ and Ĩ+ are smoothed versions of  and  + , respectively.Ĩ and Ĩ+ are obtained by applying 9 × 9 Gaussian filtering with standard deviation of 1.5.This makes  robust even in presence of amniotic fluid particles and fetoscopy-image noise.When exploring the vascular network, the fetoscope mainly makes small movements.Using similarity metrics with images with low texture whose overlap percentage is high due to small displacements results in very high similarity values that are not useful for identifying differences between fetoscopic image registration methods.Using values of  larger than 1 allows us to assess the presence of drift in consecutive homographies.
For qualitative evaluation, the registered frames are blended together using the Mertens-Kautz-Van Reeth exposure fusion algorithm (Mertens et al., 2007) to tackle the non-uniform light exposure of the FoV along the fetoscopic video sequence.

Comparison with the literature and ablation study
We compare our framework with SIFT, which is a standard feature extractor used for mosaicking by Daga et al. (2016); Reeff et al. (2006).We further compared our framework with with Bano et al. (2020a), which relies on deep learning for mosaicking and is the best performing methods in the state of the art.For all our competitors, we replace any discarded homography with an identity matrix to preserve the frame numerical consistency across the methods.The ablation study characteristics are summarized in Table 2.
As ablation study, we considered the following experiments: • Experiment 0 (E0): SuperPoint pre-trained on MS-COCO 2014 dataset, without any fine tuning on fetoscopy data.
For E2, we further investigates the performance obtained on an extended version of the dataset presented in Sec.3.1.This extended version consists of the same 6 videos, but each video has an extended length (avg sequence length = 546 ± 237 frames).This allows us to evaluate the benefits of introducing homography filtering for longer video sequences.

Results
The average  values with  equal to 5 obtained with SIFT, the work in Bano et al. (2020a) and the proposed framework are reported in Table 3. SIFT shows the lowest performance, as it fails in retrieving keypoints for mosaicking for several frames in all the 6 videos.This is in agreement with similar findings in the SDS/CAI field reported by (Liu et al. (2020)).
For Video 1 and Video 2, where vessels are clearly visible and lens distortion is small, we obtained  with =5 equal to 0.750 ± 0.050 and 0.766 ± 0.048, respectively.These results are comparable to that of Bano et al. (2020a) (0.757 ± 0.081 and 0.788 ± 0.050, respectively).Hence, the work in Bano et al. (2020a) slightly outperforms the proposed framework for Video 1 and Video 2 by only 0.007 and 0.022, respectively.This was not true when considering the other videos, where the average  was the highest for the proposed framework, which also granted the lowest standard deviation.The proposed framework overcomes Bano et al. (2020a) by at least 0.007 (video 5), with the highest difference for video 6 (0.045) and video 4 (0.125).
Figure 4 reports the value of  at different  obtained with SIFT, the work in Bano et al. (2020a) and the proposed framework for the 6 tested videos.The proposed framework consistently outperformed the competitors for every  for videos from 3 to 6.In the first two videos, the performance of the proposed framework were comparable to that of Bano et al. (2020a).
Figure 5 shows the trend of  with  = 5 in time for the proposed method and Bano The quantitative analysis presented in Fig. 5 may also be appreciated from the qualitative examples shown in Fig. 6, where the proposed framework achieves goodquality mosaicking for all the tested videos also when vessels are not visible.
The pretrained SuperPoint (E0) achieves  over  = 5 frame of 0.331, showing lower performance than also SIFT.E1, which aims at assessing the performance of vanilla KPN alone, hence excluding both inconsistent keypoint rejection and homograghy filtering.In this experiment, we achieve an average  of 0.788, with a lost of 0.058 over the proposed framework.
With our ablation study E2, which aims at evaluating the benefit of introducing inconsistent keypoint rejection after the KPN, we achieve an average  with  = 5 of 0.848.Despite the relatively small difference (0.064) in the performance achieved by our framework over E2, inconsistent homography filtering allows us to lower the drift in the mosaic and mitigate tracking loss in challenging videos, where images are strongly under-exposed or whether noisy keypoints are computed (e.g., due to particles).However, when processing the extended version of this dataset with longer sequences, our results improve by 3% when adding homography filtering, as shown in Table 4.

Discussion and conclusions
In this work, we proposed a mosaicking framework to perform FoV expansion in fetoscopy videos using learning-based keypoints.Going beyond the current state of the art, our framework does not rely on any prior vessel segmentation for keypoint prediction, which makes it robust when registering frames where vessels are not clearly visible or when vessel segmentation is not accurate.We instead use surgical-tool and fetus segmentation to filter out irrelevant keypoint and propose a simple yet effective strategy to filter out unrealistic homographies.
To test our first research hypothesis (H1), we applied our proposed framework on challenging clinical videos from 6 different TTTS surgeries.We also compared the proposed framework with state-of-the-art approaches for fetoscopy mosaicking (Table 3, showing that our method performs well when others fail.From our experiments, as shown in Fig. 7, SIFT was not always able to compute a sufficient number of keypoints to compute homography.When keypoints were computed in a sufficient number, they did not allow us to model camera motion.This can be explained considering that SIFT is not robust in case of images with low contrast and texture, as is the case of fetoscopic frames. As for the comparison with Bano et al. (2020a), the absence of placenta vessels in a number of consecutive frames (Video 3, 4 and 5) compromised the registration process, while this was not the case for our framework.Moreover, from Video 3 to Video 6 the placenta surface is not perfectly planar in all frames, the lens distortions is more evident and camera moves along different planes to scan the entire placenta surface.Nonetheless, the proposed framework did not fail in providing good quality mosaicking.This can be explained considering our self-supervised training with homographic augmentation, which allowed us to detect robust keypoints to estimate homography despite changes in perspective.Our framework hence allows FoV expansion also in videos that suffer from change of perspective, as in case of anterior placenta (Video 3 to Video 5).
With our ablation study (E0), we further highlighted the benefit of the proposed framework over the pre-trained SuperPoint without any fine tuning on fetoscopy data.The pre-trained SuperPoint has been optimized to compute robust keypoints from natural images, and thus face challenges in dealing with the complexity of fetoscopic images.This explain why, its out-of-the shelf performance is even lower than SIFT.
Our second research hypothesis (H2) was focused on assessing the benefits of including irrelevant keypoint rejection and inconsistent homography filtering.When analyzing  over the entire sequences (Fig. 5), our framework showed a lower number of drops in  than Bano et al. (2020a).However, small drops were present in Video 3 and Video 4. This may be due to underexposed frames where keypoint estimation is particularly challenging.However, as the amount of underexposed frames was reasonably small, the inconsistent homography filter was able to tackle the challenge.
The benefit of adding inconsistent homography filtering was specifically useful in long range videos from the extended dataset.We explain this improvement considering that the extended dataset includes further challenges (i.e., field of view occlusions, faster fetoscope movements and extreme change in illumination).
The proposed approach allows for the computation of robust keypoints from intraoperative images, which, contrary to methods proposed in the literature, allows for fetoscopic mosaics with less drift.As an additional advantage, obtaining keypoints and descriptors enables the integration with localization and mapping frameworks (e.g., SLAM) widely employed for robotic and automotive, paving the way towards the realisation of a complete mosaicking and navigation system for fetal surgery.
Possible limitations of the proposed framework may be encountered during sudden changes in illumination or with highly over-or under-exposed images.In these circumstances, it may not be possible to detect a sufficient number of keypoints to calculate homography robustly.In such a case, the inconsistent homography filter may mitigate the failure of mosaicking only if the changes happen within a few frames.Another possible limitation of this framework is the absence of maternal breath handling.Although this does not compromise the usability of the generated mosaic, it may introduce some minor distortions.
A limitation of the experimental protocol may be seen in the dataset size, but this is currently the largest available dataset for in-vivo fetoscopy mosaicking.The proposed framework may also be potentially translated to other surgical fields, including neuro microsurgery for the treatment of gangliogliomas.
Our experimental results suggest that the proposed framework may be effective in supporting surgeons during surgical procedures for treating TTTS by providing a larger FoV.This may have a positive impact, by reducing surgeons' mental workload and, as a consequence, potentially reducing patients' risks and lowering surgery duration.

Ethical standards
The proposed study is a retrospective study.Data used for the analysis were acquired during actual surgery procedures and then were anonymized to allow researchers to conduct the study.All the patients gave their consent on data processing for research purpose.The study fully respects and promotes the values of freedom, autonomy, integrity and dignity of the person, social solidarity and justice, including fairness of access.The study was carried out in compliance with the principles laid down in the Declaration of Helsinki, in accordance with the Guidelines for Good Clinical Practice.

Declaration of Competing Interest
No benefits in any form have been or will be received from a commercial party related directly or indirectly to the subjects of this manuscript.

Figure 2 :
Figure 2: Overview of our mosaicking framework.The Keypoint Proposal Network (KPN) computes keypoints that are then filtered to reject irrelevant keypoints.Registration for mosaicking is performed to register consecutive fetoscopy frames.Warping and blending are performed for visual purposes.

Figure 3 :
Figure 3: Overview of the improved Keypoint Proposal Network.KH and DH are the keypoint and keypoint descriptor head, respectively.Irrelevant keypoint rejection relies on surgical-tool and fetus segmentation performed by the U-Net with ResNet50 backbone from Bano et al. (2020b).The overall output is a set of keypoints and their descriptors.

Figure 5 :
Figure 5: Plot of  with =5 computed for all video length.The curves refer to (red) Bano et al. (2020a) and (orange) the proposed framework.

Figure 6 :
Figure 6: Mosaics obtained from the 6 TTTS videos using the method from Bano et al. (2020a) and the proposed framework.Results refer to the dataset presented in Sec.3.1.

Table 3 :
Quantitative results for the 6 tested in-vivo fetoscopy videos.The  with n = 5 frames is reported in terms of mean ± standard deviation.

Table 4 :
Quantitative results for the extended dataset with longer fetoscopy sequences.