Non-rigid point cloud registration for middle ear diagnostics with endoscopic optical coherence tomography

Purpose Middle ear infection is the most prevalent inflammatory disease, especially among the pediatric population. Current diagnostic methods are subjective and depend on visual cues from an otoscope, which is limited for otologists to identify pathology. To address this shortcoming, endoscopic optical coherence tomography (OCT) provides both morphological and functional in vivo measurements of the middle ear. However, due to the shadow of prior structures, interpretation of OCT images is challenging and time-consuming. To facilitate fast diagnosis and measurement, improvement in the readability of OCT data is achieved by merging morphological knowledge from ex vivo middle ear models with OCT volumetric data, so that OCT applications can be further promoted in daily clinical settings. Methods We propose C2P-Net: a two-staged non-rigid registration pipeline for complete to partial point clouds, which are sampled from ex vivo and in vivo OCT models, respectively. To overcome the lack of labeled training data, a fast and effective generation pipeline in Blender3D is designed to simulate middle ear shapes and extract in vivo noisy and partial point clouds. Results We evaluate the performance of C2P-Net through experiments on both synthetic and real OCT datasets. The results demonstrate that C2P-Net is generalized to unseen middle ear point clouds and capable of handling realistic noise and incompleteness in synthetic and real OCT data. Conclusions In this work, we aim to enable diagnosis of middle ear structures with the assistance of OCT images. We propose C2P-Net: a two-staged non-rigid registration pipeline for point clouds to support the interpretation of in vivo noisy and partial OCT images for the first time. Code is available at: https://gitlab.com/nct_tso_public/c2p-net.


Introduction
The middle ear consists of the tympanic membrane (TM) and an air-filled chamber containing the ossicle chain (OC) that connects the TM to the inner ear.From a functional perspective, the middle ear matches the impedance from air to the fluid-filled inner ear [1].Middle ear disorders contain deformation, discontinuation of TM and OC, effusion, and cholesteatoma in the middle ear.These may occur because of middle ear infection (e.g.acute otitis media (AOM), chronic otitis media (COM), otitis media with effusion (OME)) and trauma [2,3].Most commonly, serious and chronic middle ear disorder may lead to conductive hearing loss and inner ear disorder [4].Current middle ear diagnostics methods or tools cover only aspects of the pathology and cannot determine the origin and site of transmission loss, e.g.otoscopy, audiometry, tympanometry, etc.As an innovative image technology, endoscopic OCT enables the examination of both the morphology and function of the middle ear in-vivo by the non-invasive acquisition of depth-resolved and high-resolution images [5][6][7].Nevertheless, multiple sources of noise, e.g. the shadow of preceding structures, reduce the quality of the target structures further away from the endoscopic probe, e.g.ossicles.Therefore, the reconstructed 3D models from the OCT volumetric data are usually noisy and difficult to interpret, especially the identification of the deeper middle ear structures such as incus and stapes (see Fig. 1).
Our aim is to improve the interpretation of the OCT volumetric data by merging the ex-vivo middle ear model reconstructed from micro-Computed Tomography (µCT) scan of isolated temporal bones and the in-vivo OCT model.Note that a single µCT model was used as a template and fitted to all patient-specific OCT scans.We first convert all ex-and in-vivo middle ear data to point cloud representations for ease of flexible manipulation of middle ear shapes.Nevertheless, finding one-to-one correspondence between such a point cloud pair is still challenging due to the noise and incompleteness of the OCT model, and the difference between the patient data and the template µCT data.To tackle these issues, we first employ a neural network that searches the sparse correspondences for the source and target points, then a pyramid deformation algorithm to fit the points in a non-rigid fashion based on the predicted correspondences.The neural network is trained on randomly generated synthetic shape variants of the middle ear that contain random noise and only partial points.This enables the generalization of the neural network to new patient data as well as the adaption to noise and outliers The main contributions of this work are listed below: 1.A generation pipeline in Blender3D that simulates synthetic shape variants from a complete ex-vivo middle ear model, and simulates noisy and partial point clouds comparable to in-vivo data.2. C2P-Net: a two-staged non-rigid registration pipeline for point clouds.It demonstrates that C2P-Net registers the complete point cloud template to partial point clouds of the middle ear effectively and robustly.

Related Work
In this section, we review methods that analyze 3D point clouds and solve registration problems.Learning the geometry of 3D point clouds is the foundation of diverse 3D applications, but is also a challenging task due to the irregularity and asymmetry of point clouds.Recently, approaches that focus on learning point features with neural networks have been widely investigated to overcome such issues.One category of methods is to project the point cloud to a regular representation, e.g.voxel grids [8], 2D images from multi-views [9,10], or combined [11] that 2/3D convolutional operations can be performed on top of these intermediate data in Euclidean space.Apart from these, PointNet [12] leverages multilayer perceptron (MLP) to obtain pointwise features and a max-pool to aggregate global features but without local information.This drawback is alleviated by PointNet++ [13], which uses PointNet to aggregate local features in a multi-scale structure.In addition, various explicit convolution kernels are applied directly on the point clouds to extract encodings [14][15][16] in which point features are extracted based on the kernel weights.Conventional methods solve the point cloud registration task as an optimization problem of transformation parameters.Iterative Closest Point (ICP) [17] iteratively calculates a rigid transformation matrix based on updated correspondence set from a last iteration, Non-rigid Iterative Closest Point (NICP) [18] achieves non-rigid registration by optimizing an energy function including local regularisation and stiffness.Coherent Point Drift (CPD) maximizes the probability of Gaussian Mixture Model (GMM) by moving coherent GMM centroids from the two point clouds.Recently, deep learningbased methods using the aforementioned point cloud encoding techniques have been widely investigated [19][20][21][22][23]. PointNetLK [19] extracts semantic point cloud features using MLP, then solves rigid alignment using the Lucas and Kanade algorithm [24].PR-Net [20] follows a similar way of feature extraction while it maps the feature to regular grids and defines the registration task as a correlation problem.RegTR [21] and NgeNet [22] both introduce attention mechanisms to enable the communication of the extracted point cloud features from the source and target and estimate a global transformation from the predicted correspondences with MLPs.In contrast to this direct way, NDP [23] decomposes the global transformation into sub-motions which are iteratively refined using MLPs at each pyramid level.Inspired by this recent work regarding point cloud analysis and registration, we adopted NgeNet [22] and NDP [23] as the main components to construct our registration pipeline.

Methods
The non-rigid registration of the ex-and in-vivo middle ear point clouds (see Fig. 1) using neural networks is realized in a supervised learning fashion.Due to lack of real OCT data with ground truth, and the difficulties of finding one-toone correspondences between the general and ex-vivo model and the noisy and partial in-vivo model, we generated synthetic in-vivo middle ear point clouds based on the ex-vivo µCT model using a two-step simulation pipeline.For point cloud registration, we propose C2P-Net, which is a two-stage registration method.We first train a neural network on the synthetic samples to explore the correspondence along with an initial rigid transformation matrix, then based on this, a hierarchical algorithm is performed to estimate the non-rigid deformation.

Synthetic Data Generation
We start with a complete middle ear model M µCT reconstructed from µCT volume as basis for the simulation in Blender3D.The non-rigid simulation is performed with the assistance of lattice modifiers attached to each structure with reasonable resolution.The lattice vertices are assigned to various groups according to length, thickness, and width of the global and local shape of a structure.By transforming and rotating various vertice groups with random parameters with boundary conditions, a random shape of a structure can be generated, e.g. a relatively shorter malleus with a longer lateral process.The non-rigid simulation step is formulated as Tnr = S nr (T, p nr ), where T is the template middle ear, p nr are the input parameters, and Tnr is the simulated non-rigid shape variants of T .Next, armature modifiers are attached to each structure and connected in an end-to-end fashion at the articulations.Then, rigid simulation is accomplished by transforming and rotating the armature bones.We denote the rigid simulation as S r .Thus, a random shape variant of the middle ear T can be obtained which is considered as the in-vivo ear model of a new patient: T = S r ( Tnr , p r ).
In practice, the in-vivo OCT model is usually noisy and only partially visible, which limits the registration methods in finding accurate and robust correspondences.To simulate the real data, we extract random partial patches for each structure of the shape variant T by a combined probability, which is calculated as the distance of each vertex to the support point of each structure and random Gaussian noise.Additionally, for the sake of simulating incomplete posterior structure, e.g.stapes, caused by the occlusion and shadow from the anterior structures, e.g.ear canal wall, we determine the number of points of the posterior structures by the distance to the external ear and a random factor.Furthermore, uniform random displacements are applied to the vertices to generate the final noisy and partial in-vivo point cloud P inv (see Fig. 1).

C2P-Net
We delineate the architecture of C2P-Net in Fig. 2. It consists of two components dedicated to two stages: initial rigid registration and pyramid non-rigid registration.Given an ex-vivo point cloud as a template extracted from a µCT model: P exv = {x i ∈ R 3 } i=1,2,...,N , and a partial point cloud of the simulated in-vivo shape variant: P inv = {y j ∈ R 3 } j=1,2,...,M , we adopted the Neighborhood-aware Geometric Encoding Network (NgeNet) [22] to solve the initial rigid registration task.This stage is formulated as: τ, σ = NgeNet(P exv , P inv ) where (u, v) ∈ σ where τ ∈ SE(3) is the rigid transformation matrix which aligns P exv with P inv , and (u, v) ∈ σ is the sparse correspondence set where u and v are indices for points in P exv and P inv .Due to the multi-scale structure and a voting mechanism integrating features from different resolutions, NgeNet handles noise well and predicts correspondence robustly.
Based on the previous predicted correspondence set and the rigidly aligned source and target point clouds, we employ the Neural Deformation Pyramid (NDP) [23] to predict the non-rigid deformation of the given point cloud pair.NDP defines the non-rigid registration problem as a hierarchical motion decomposition problem.At each pyramid level, the input points from last level are mapped to sinusoidal encodings with different frequencies: Γ(p k−1 ) = (sin(2 k+k0 p k−1 ), cos(2 k+k0 p k−1 )), k is the current level number, k 0 controls the initial frequency and p k−1 is an output point from the last level.Lower frequencies at shallower levels represent rigid sub-motion, while higher frequencies at deep levels emphasize non-rigid deformations.In this way, a sequence of submotions is estimated from rigid to non-rigid, and the final displacement field is the combination of such a sequence.Formally, we denote the stage as: where Pexv is the source point cloud P exv transformed by τ , n is the number of pyramid layers of the NDP neural network, m is the maximal iteration within a single pyramid layer, and φ est is the predicted displacement field describing how each point should move to the target.Combined losses are calculated at each iteration, including correspondence loss and regularization loss, and backpropagated to update the weights of each MLP.Of which, the correspondence loss L CD is defined as the Chamfer distance (3) between Pexv which is masked by the correspondence σ and P inv .

Evaluation
C2P-Net predicts a displacement field to fit the ex-vivo point cloud template to the target in-vivo point cloud which is noisy and partial.At training time, C2P-Net employs an Adam optimizer and ExpLR scheduler and uses supervised learning to learn on a synthetic dataset with 20,000 in-vivo shape variants of the middle ear.During inference, it takes around 3.5s for C2P-Net to react to a new in-vivo shape on RTX 2070 Super.Since the neural network is trained on synthetic data, it is crucial to investigate the generalization of the neural networks on unseen samples as well as the performance on real OCT data.Thus, we conduct two experiments on synthetic data and real OCT data with various metrics: mean displacement error ( 5), Chamfer distance (3), and landmarks error (6).Furthermore, we compare our results with other popular non-rigid registration methods, e.g.Non-rigid ICP (NICP) and Coherent Point Drift (CPD).
N is the size of the test dataset, φ est is the estimated displacement field of a sample, and φ gt is the corresponding ground truth.
In Silico.A synthetic dataset is generated using the pipeline described in section 3.1 with the same boundary conditions, and the target shape variants are unseen to the neural network during training.For each target shape, we estimated the displacement field for the identical template point cloud using C2P-Net trained on 20,000 samples.The mean target displacement field for the synthetic dataset is 1.54mm.Fig. 3a illustrates that the neural network tends to predict the deformation of the ex-vivo point cloud with higher accuracy if it obtains more information about the target in-vivo OCT point cloud.With few exceptions, Fig. 3b shows when the initial registration error made by NgeNet is low, the final global non-rigid displacement field from NDP tends to be small.In Vivo.A real OCT dataset with 9 samples of the human middle ear was collected with a handheld OCT imaging system [5] and segmented manually by surgeons within 3D Slicer along with a set of sparse landmarks for each structure of a sample (see Fig. 1).Since there is no ground truth correspondence available for the source and target point clouds to calculate M MDE , we introduce another metric based on the sparse landmarks: where L exv are the landmarks on the ex-vivo point cloud from µCT, which is also marked manually, L inv are the corresponding landmarks on the target invivo point cloud, and the landmark points belong to K different segmentations.
Tab. 1 itemizes the registration results of C2P-Net and baseline models on both datasets.We can observe our registration pipeline outperforms the other methods in terms of M MDE and M L .However, baseline models tend to have smaller M CD , which means the source point clouds are deformed largely to fit the target regardless of the anatomical geometry of the middle ear.Furthermore, this issue is demonstrated visually in Fig. 4, which depicts several registration results from C2P-Net and baseline models on real OCT samples.Hence, it is manifested that C2P-Net is able to register the partial target point cloud without losing the underlying geometry of source point clouds, which is the critical information clinicians want to obtain with registration.On the contrary, the baseline models only focus on the spatial position of local points regardless of the neighbouring and global geometry of the middle ear template.

Conclusion
In this work, we propose to improve the interpretation of OCT data for surgical diagnosis by fusing the knowledge from ex-vivo and general µCT data with the in-vivo noisy and incomplete OCT data.To this end, we propose C2P-Net which is based on NgeNet and Neural Deformation Pyramid.It takes the exand in-vivo 3D point clouds as input and explores the sparse correspondence between the two, then aligns them in a non-rigid fashion.Due to the lack of training data, a fast and effective synthetic simulation pipeline was designed using Blender3D, which produces noisy and partial point clouds as seen invivo based on the randomly generated shape variants of the middle ear.Our neural network was trained based on 20,000 synthetic samples, and it took around 3.5s to predict the displacement field for a given in-vivo point cloud at inference time.
To assess the performance of C2P-Net, experiments on synthetic and real OCT datasets were carried out.Since the real OCT data is noisy and incomplete, we investigated various metrics including mean displacement error, landmark error, and Chamfer distance and compared C2P-Net to the other baseline models from both statistics and geometry aspects.From Tab. 1 and Fig. 4, we can see that C2P-Net is able to deal with the partial OCT data and retain the anatomical structure during non-rigid registration, while other non-rigid methods do not understand the global geometry of the point clouds and focus only on local points.
Future work will focus on improving registration accuracy.This can be achieved by improving the simulation pipeline with realistic noise to further bridge the gap between synthetic and real data.Furthermore, the segmentation information can be employed by the network to improve the prediction of correspondences.Apart from this, inference time can be further decreased by exploring the optimal configuration of the network for the OCT samples.

Fig. 1 :
Fig. 1: Overview of data samples.In a), I is the ex-vivo template point cloud reconstructed from µCT scans on the left.II is the segmented middle ear model, each color stands for one structure.III shows the model with sparse labels.Accordingly, I, II and III in b) present in-vivo OCT point cloud, segmented OCT model and labels.The simulation pipeline on the top left takes a complete ex-vivo ear point cloud, performs non-rigid and rigid simulation, and produces synthetic partial and noisy in-vivo point cloud shown in c).

Fig. 2 :
Fig. 2: Schematic architecture of C2P-Net.The input of the pipeline is the ex-vivo (blue) and in-vivo (orange) point cloud.To explore the correspondence between the two inputs, NgeNet is trained on synthetic samples to extract the features for each point at different levels.These are integrated by a voting algorithm.A feature matching routine is performed to obtain the sparse correspondence set as well as a rigid transformation matrix.On top of the previous outputs, NDP maps the input point clouds to sinusoidal space and decomposes the global non-rigid deformation at each pyramid level into submotions.With the increase of sinusoidal frequency, the non-rigid deformation of each point is continuously refined at each level.The registered point cloud is shown on the right side.

Fig. 3 :
Fig.3: a) shows the relation between the MDE of synthetic samples and the visible points ratio, which is calculated as the ratio between the point number of a target point cloud to the corresponding ground truth point cloud:|P inv |/| T |.The decreasing trend shows that with more points available in the target in-vivo point clouds, the neural network tends to predict better deformation.b) depicts the MDE to the initial rigid registration error which is produced by the NgeNet.The better the initial rigid registration and correspondence set predicted by NgeNet, the lower the global non-rigid registration error from NDP becomes.

Fig. 4 :
Fig. 4: Registration results of baseline models and C2P-Net on synthetic (SYN) and real OCT samples (REAL).Blue point clouds are the deformed ex-vivo ear models, i.e. the shape template, calculated by all the investigated methods.The orange point clouds stand for the registration target, i.e. in-vivo point clouds.From the visualization, we can see baseline non-rigid methods, NICP and CPD, tend to squeeze the points from the ex-vivo template largely to match the target point clouds, since they only focus on position of local points, while C2P-Net retains the global geometry of the middle ear due to the learned correspondence knowledge between the ex-and in-vivo point clouds.

Table 1 :
Experiments results on synthetic data (SYN) and real OCT data (REAL), including mean displacement error (M MDE )[mm], landmarks error (M L )[mm] and Chamfer distance (M CD )[mm] rigid registration method(s).† baseline methods tend to fall into local minima and ignore the geometry, though the CD lower. *