Surgical biomicroscopy-guided intra-operative optical coherence tomography (iOCT) image super-resolution

Purpose Intra-retinal delivery of novel sight-restoring therapies will require the precision of robotic systems accompanied by excellent visualisation of retinal layers. Intra-operative Optical Coherence Tomography (iOCT) provides cross-sectional retinal images in real time but at the cost of image quality that is insufficient for intra-retinal therapy delivery.This paper proposes a super-resolution methodology that improves iOCT image quality leveraging spatiotemporal consistency of incoming iOCT video streams. Methods To overcome the absence of ground truth high-resolution (HR) images, we first generate HR iOCT images by fusing spatially aligned iOCT video frames. Then, we automatically assess the quality of the HR images on key retinal layers using a deep semantic segmentation model. Finally, we use image-to-image translation models (Pix2Pix and CycleGAN) to enhance the quality of LR images via quality transfer from the estimated HR domain. Results Our proposed methodology generates iOCT images of improved quality according to both full-reference and no-reference metrics. A qualitative study with expert clinicians also confirms the improvement in the delineation of pertinent layers and in the reduction of artefacts. Furthermore, our approach outperforms conventional denoising filters and the learning-based state-of-the-art. Conclusions The results indicate that the learning-based methods using the estimated, through our pipeline, HR domain can be used to enhance the iOCT image quality. Therefore, the proposed method can computationally augment the capabilities of iOCT imaging helping this modality support the vitreoretinal surgical interventions of the future.


Introduction
Regenerative therapies (e.g. [1,2]) are emerging as treatments for blinding retinal diseases such as Age-Related Macu-  3 Moorfields Eye Hospital, London EC1V 2PD, UK 4 Institute of Ophthalmology, University College London, London EC1V 9EL, UK lar Degeneration [3]. Their efficiency, however, will depend on their precise injection in the sub-retinal and intra-retinal space. High-resolution cross-sectional (B-scans) images of the retina are required so that the retinal layers of interest can be visualised with quality adequate for injection guidance. Optical Coherence Tomography (OCT) captures such cross-sectional retinal images.
Intra-operative OCT (iOCT), acquired through recently introduced modified biomicroscopy systems such as Zeiss OPMI/Lumera and Leica Proveo/Enfocus, can be delivered in real time but at the expense of image quality (low signal strength and increased speckle noise [4]) with regard to preoperative OCT. The produced iOCT scans are ambiguous and with limited interventional utility. While complementary research develops higher-quality iOCT systems, e.g. [5], we focus on computationally enhancing the capabilities of already deployed clinical systems.
An established approach to OCT quality enhancement is denoising. Spatially adaptive wavelets [6], Wiener filters [4], diffusion-based [7] and registration-based techniques [8] reduce speckle noise while preserving edges and image features. Unfortunately, these methods are limited by prolonged scanning periods, alignment errors and high computational cost, which limit their effectiveness for real-time interventions and iOCT.
Despite its superior quality, pre-operative OCT is acquired under different protocols (date, patient position, device) than iOCT, implying a domain gap in addition to deformations that may lead to generated images with artefacts. Therefore, our paper considers iOCT information only. We propose a methodology that uses high-resolution (HR) iOCT images generated offline through registered and fused low-resolution (LR) iOCT video frames (B-scans). Generated HR images are ranked for quality considering metrics that incorporate the quality of segmented retinal layers. High-scoring HR images comprise the target domain for image-to-image translation. Several image quality metrics and a complementary qualitative survey showcase that our super-resolution methodology improves iOCT image quality outperforming filter-based denoising methods and the learning-based state-of-the-art [19].

Methods
This section presents the process of creating HR iOCT images, validating their quality and generating SR iOCT images through image-to-image translation.

Data
Our data are derived from an internal Moorfields Eye Hospital database of vitreoretinal surgery videos, including intra-operative and pre-operative OCTs. We use a data-complete subset comprising 42 intra-operative retinal surgery videos acquired from 22 subjects. The data contain the surgical microscope view captured by a Zeiss OPMI LUMERA 700 with embedded LR iOCT frames (resolution of 440x300) acquired by RESCAN 700 (see Fig. 1). These

HR iOCT generation
Generating H R iOCT images is based on fusing registered iOCT video frames that are acquired from the same retinal position by averaging the temporal information. This process is illustrated in Figs. 2 and 3.
First, for each surgery video (Fig. 1a), we identified the time intervals wherein both iOCT scan position and retina points positions remain constant. During such intervals, the acquired iOCT B-scans can be considered as corresponding to the same retinal location and can therefore be registered and fused to acquire a HR B-scan.
The position of the iOCT scan is obtained by detecting the white square depicted on the surgical microscope view (see Fig. 1a), which illustrates the iOCT's scanning region. Detection starts with binary thresholding, Canny edge detection and Hough line transform on the microscope view image. To improve the robustness of identifying the iOCT scan position, we further detected the cyan and magenta arrows inside the already detected square (see Fig. 2). Two points (one point per arrow) were derived to represent the iOCT scan position.
Due to retina movement (patient breathing, surgical interactions) we must verify that the retina is also stationary. Therefore, we manually selected a point at the start of each video sequence that corresponds to a strong feature (e.g. vessel bifurcations), and tracked it using Lucas-Kanade method 2 .
If the aforementioned positions remained constant for more than eight consecutive video frames (number empirically selected), the corresponding iOCT B-scans were then rigidly registered to the first B-scan and averaged to generate the corresponding H R iOCT frame (Fig. 3). We applied rigid registration as we wanted to avoid unrealistic deformations (e.g. folding) that non-rigid registration might introduce,  As the videos depict actual surgical procedures, many incoming LR iOCT images have low signal strength, calculated as signal to noise ratio (SNR) [20]. Thus, their corresponding fused H R images will be of low SNR as well. Furthermore, imperfections in tracking retina points and registration errors between the LR iOCT images could lead to blurry averaged H R iOCT scans. These factors affect the quality of many H R images, which as a result lessens the robustness of the estimated H R domain in terms of SNR and contrast.
To assess the quality of the generated images and define which ones should be included in the H R dataset, we used three different metrics, i.e. SNR, Equivalent Number of Looks (ENL) and Contrast to Noise Ratio (CNR) [4]: where F lin is the linear magnitude image, σ lin the standard deviation of F lin in a background noise region, μ b , μ h , μ r , σ b , σ h , σ r are the mean and standard deviations for background region (b), homogeneous regions (h) and all regions of interest (r), respectively. In our image quality assessment, we empirically used H = 2 and R = 4 (see Fig. 3). To obtain metrics describing image quality on key anatomical landmarks, namely, retinal layers, we compute retinal layer masks using a deep semantic segmentation model. Then, metric computation takes place for regions of interest (ROI) tightly cropped around retinal layers.

Retinal layer segmentation
The segmentation model utilizes the architecture introduced in [21] and is trained using the Lovász-Softmax loss [22]. Due to the lack of large public pixel-level annotated datasets, we first pretrain the model for retinal fluid segmentation on the RETOUCH 3 dataset, which contains 3200 images (72 subjects). The model was then fine-tuned for the task of retinal layer segmentation on the DUKE dataset 4 , which comprises 610 images (10 subjects). We qualitatively observed acceptable generalization of the segmentation model to our intra-operative OCT dataset. It is also worth mentioning that our aim is not a perfect segmentation of retinal layers but an acceptable approximation of the background area and pertinent retinal layers in the iOCT image in order to extract ROIs for the calculation of SNR, CNR and ENL. Given the output label maps of the segmentation model, five ROIs are chosen (see Fig. 3): a background ROI (red rectangle), two small homogeneous ROIs on the second and the last retinal layers (blue rectangles), and two large ROIs on the first and the last retinal layers (green rectangles). The centre of the ROIs is random in the B-scan, so long as the aforementioned location constraints are respected, which stem from the requirements of the quality metrics themselves. Using (1-3), the ROIs, and considering empirically defined thresholds of 70.0, 3.0 and 10.0 for SNR, CNR and ENL, respectively, we identified 962 H R images of acceptable quality to form the H R dataset.

Deep learning models
To perform super-resolution (SR), we used two state-of-theart image-to-image translation models: CycleGAN [11] and Pix2Pix [12]. These models belong to the family of GANs which alternately train a generator G and a discriminator D in an adversarial manner. Pix2Pix requires supervision in the form of aligned image pairs to update its generator G as it minimizes the L1 loss between images of source (LR) and target ( H R) domain. On the contrary, CycleGAN can be trained without the need of paired examples using cycle consistency to enforce mappings between forward (G : L R → H R) and backward (G : H R → L R) direction. Preliminary experiments, however, revealed that CycleGAN produced inconsistent results on unpaired images. We therefore also include L1 supervised losses for training CycleGAN.

Implementation details
The dataset (962 image pairs of LR and H R iOCT images) was split into three subsets: training set (70%), validation set (10%) and test set (20%). We performed online data augmentation for the training set through rotation (±5 • ), translation(±30 width, ±20 height), horizontal flip (with a probability of 0.5), scale (1 ± 0.2) and the Albumentations 5 'colorjitter' augmentation with brightness and contrast between [2/3, 3/2]. Our implementations of Pix2Pix and CycleGAN are based on the code available online 6 , and both networks use CycleGAN's ResNet-based generator [10] with 9 residual blocks. Our networks are trained using Adam Optimizer, for 200 epochs, with a batch size of 4 and input resolutions of 440x300 for Pix2Pix and CycleGAN. Our experiments ran on an NVIDIA Quadro P6000 GPU with 24 GB memory.

Results
This section presents the results of the quantitative and qualitative analysis that we performed to validate our SR pipeline. We also validate the merit of employing deep learning for this task by comparing our models with classical filter-based OCT denoising techniques and the learning-based state-of-the-art.

Quantitative analysis
We quantitatively validate the quality enhancement of the SR images compared to the LR iOCT images. As our ground truth (HR) images are estimated by our methodology, fullreference metrics alone are not sufficient in image quality evaluation. Therefore, our analysis uses six different metrics including two full-reference metrics, i.e. Peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM) and four no-reference metrics, i.e. perceptual loss function ( f eat ) [10], Frechet Inception Distance (FID) [23], Global Contrast Factor (GCF) [24] and Natural Image Quality Evaluator (NIQE) [25]. The metric values were calculated on the test images of LR iOCT, SR using the state-of-the-art method of [19], SR using Pix2Pix [12] (SR-Pix) and SR using Cycle-GAN [11] (SR-Cyc). The evaluation metrics were computed on the original resolution (440x300px) for both Pix2Pix and CycleGAN outputs. The results are reported in Table 1. We assessed the statistical significance of the pairwise comparisons using paired t test. All p-values were p < 0.001 except for pairwise comparisons between SR-Cyc and filter-based methods for SSIM.
Reference metrics (PSNR, SSIM) were calculated using H R as reference images. As far as no-reference metrics are concerned, perceptual loss, f eat , calculates the high-level perceptual similarity between two image domains by computing the distance of their feature representations extracted  Table 1, SR-Cyc ranks first in terms of PSNR, SSIM f eat and FID, which shows that the image quality has been improved and is perceptually more similar to H R (see also Fig. 4). Regarding GCF, the more noisy images (LR and SR output by [19]) exhibit higher values, probably due to the appearance of high-frequency information (speckle noise). Finally, for frames of size 440x300, SR-Cyc performs at 18.17 frames per second (FPS), while Pix2Pix at 17.51 which both are appropriate for iOCT real-time requirements.

Qualitative analysis
To further validate our super-resolution pipeline, we performed qualitative analysis. Our survey included 20 pairs of LR and SR-Cyc images, randomly selected from the test set. We asked 8 retinal doctors/surgeons to evaluate these image pairs by assigning a score between 1 (strongly disagree) and 5 (strongly agree) on the following questions: Their answers, A1, A2, A3 (mean±standard deviation), indicate that SR-Cyc images provide improved delineation of RPE vs IS/OS junction (Q1), reduction of artefacts (Q2) and improved delineation of ILM vs RNFL (Q3). Visual results are shown in Fig. 4, confirming the findings of our survey.

Denoising results
To demonstrate the denoising effect of our work, as part of the broader aim of image quality enhancement, we compare our optimal (according to the metrics) network (SR-Cyc) with conventional denoising filters. We selected three different state-of-the-art speckle reduction methods for OCT images: Symmetric Nearest Neighbour (SNN) [27], adaptive Wiener filter [28] and BM3D [29] whose denoising ability has been assessed in several works [4,18].
All the filter-based methods demonstrated considerable denoising capabilities, as shown in Fig 5. We can, however, observe that those filters blurred the images (b,c,d) and that retinal layers cannot be distinguished easily especially when compared to the outputs of SR-Pix and SR-Cyc. The SR-Cyc images, in particular, are visually more similar to the H R.
Quantitative analysis using the aforementioned metrics (see Table 1) shows that SR-Cyc achieved the best performance according to all metrics compared to the Wiener, BM3D and SNN filters. Among the filter-based techniques, SNN has the best performance according to PSNR, SSIM, f eat , FID.

Discussion and conclusions
This paper addresses the challenge of super-resolution in iOCT images. We overcome the absence of ground truth (e) [19] (f) SR-Pix (g) SR-Cyc (h) HR HR images by a novel pipeline that leverages spatiotemporal consistency of incoming iOCT B-scans to estimate the H R images. Furthermore, we automatically assess the quality of the H R images to accept only the high-scoring ones as target domain for super-resolution. Our quantitative and qualitative analysis demonstrated that the proposed super-resolution pipeline can achieve convincing results for iOCT image quality enhancement and outperform filter-based denoising methods with statistical significance. Future work will increase the sharpness of retinal layer delineations to produce iOCT images of quality even closer to pre-operative OCT scans.
unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.