Delaunay walk for fast nearest neighbor: accelerating correspondence matching for ICP

Point set registration algorithms such as Iterative Closest Point (ICP) are commonly utilized in time-constrained environments like robotics. Finding the nearest neighbor of a point in a reference 3D point set is a common operation in ICP and frequently consumes at least 90% of the computation time. We introduce a novel approach to performing the distance-based nearest neighbor step based on Delaunay triangulation. This greedy algorithm finds the nearest neighbor of a query point by traversing the edges of the Delaunay triangulation created from a reference 3D point set. Our work integrates the Delaunay traversal into the correspondences search of ICP and exploits the iterative aspect of ICP by caching previous correspondences to expedite each iteration. An algorithmic analysis and comparison is conducted showing an order of magnitude speedup for both serial and vector processor implementation.


Introduction
Point set registration presents a challenging problem for numerous applications including computer vision, pattern recognition, robotics, and image processing [42]. Registration finds a rigid transformation between two or more data sets-providing a rotation and translation that minimizes the pairwise Euclidean difference between corresponding pairs of matched points. These point clouds can be known a priori, generated from images, light detection and ranging (Lidar), or other sensors.
A common point registration method, Iterative Closest Point (ICP), typically must execute in near real-time. In selfdriving cars, on-board computers utilize the transformation found by ICP to detect objects within the car's local environment. The system may determine an adjacent vehicle is veering into its lane, creating a potential hazard and requiring immediate action. Thus, these applications require point-set registration processes be completed in a timely fashion. After investigating the steps of ICP, when utilizing a k-d tree, the B James D. Anderson anderson.10@wright.edu 1 Wright State University, Dayton, OH, USA 2 Air Force Institute of Technology, Dayton, OH, USA nearest neighbor step consumes the majority of the time, as shown in Fig. 2. Our work has shown a nearly 90% speedup when compared to the traditional k-d tree as seen in Fig. 1. For this reason, our research focuses on the Euclidean distancebased pairwise nearest neighbor matching employed by point set registration methods such as Besl's ICP algorithm [10]. We invite the reader to watch a video overview of the work detailed in this paper at [5].
Nearest neighbor matching finds a candidate point in a reference point cloud that is nearest to a corresponding point in the sensed point cloud [29]. Doing this for each point in the sensed point cloud creates a mapping that associates each point in the sensed cloud to a point in the reference cloud. Naïvely, if both point sets are size n, nearest neighbor matching is an O(n 2 ) algorithm; therefore, accelerating an algorithm such as ICP requires minimizing the nearest neighbor matching time.
This research focuses on accelerating Euclidean distancebased pairwise point matching and is applicable to all registration algorithms utilizing nearest neighbor pairs. This paper makes the following contributions:  The paper is arranged as follows. Section 2 presents related work. Section 3 introduces the Delaunay Traversal nearest neighbor algorithm. Section 4 describes variants of our proposed Delaunay Walk. Section 5 presents the experimental methodology and results. Sections 6 and 7 conclude the paper and discuss future work.

Point set registration
Paul Besl's ICP algorithm [10] is a ubiquitous point set registration process which depends on pairwise Euclidean distance correspondence. From the pair-wise correspondences, ICP registers two point clouds by computing a rotation R and translation t. This is a rigid transformation, which minimizes the mean square point matching error (MSPME) between the two point clouds. This error is the average Euclidean distance between the point correspondences after the transformation is applied.
where n = |X | and m = |P|. Additionally, each point p i corresponds to point x i with the same index. ICP executes to fit the source P onto the target X . From [42], ICP can be expressed as: arg min ICP is broken into 3 major steps: 1. nearest neighbor search 2. transformation estimation 3. transformation validation These three steps are repeated until either the error falls below a set threshold or the error between iterations is negligible. Several variants exist for each of these steps to improve accuracy and speed, particularly transformation estimation. Where Besl utilized point-to-point correspondences, Chen modified the algorithm by generating point normals for the target 3D point set and matching the query point to the plane defined by these normals, typically cited as pointto-plane [13]. Since this method registers source points to the area around the target point instead of the target point itself, the algorithm's sensitivity to noise can be reduced [8]. Point-to-plane's transformation calculation takes longer per iteration, but the entire algorithm typically converges in fewer iterations and results in a similar timing to point-to-point [25]. Extending point-to-plane, generalized ICP describes a plane-to-plane method [38]. In plane-to-plane, normal vectors are calculated on both target and source point clouds. This method can be shown to be more robust; however, since surface normals need to be calculated, plane-to-plane is typically not utilized for real-time ICP applications.
ICP handles rigid point registration with little to no scaling or shearing. Other non-rigid point registrations exist to determine an affine transformation when scaling or shearing are present. Several of these approaches utilize distancebased correspondences. For example, coherent point drift (CPD) [31] utilizes the centroids from Gaussian mixture model (GMM) of the source dataset and find an affine transformation alining those centroids to the expected data. The Delaunay-based nearest neighbor algorithm presented in this paper can be utilized in non-rigid registrations so long as the correspondences are found with a distance-based metric.

Nearest neighbor
Nearest neighbor applications span numerous fields, including pattern recognition [14], machine learning [40], and robotics [35]. Some of these require an ordered list of similar data, while others only require a single closest point. Depending on the metric used for similarity, the closest point may or may not be unique. In pairwise nearest neighbor, some distance metric such as Euclidean distance, Manhattan distance, squared distance, or other calculation is typically utilized depending on the application. This research employs pairwise Euclidean distance nearest neighbor matching.
Brute force is the generic method for determining nearest neighbors. Again, consider 3D point sets X and P. The nearest neighbor of p 0 ∈ P is arg min To match all points P to X , Eq. 2 must be evaluated for every point inside P. Again taking n = ||X || and m = ||P||, a brute force approach results in a time complexity of O(n) to match a single point and O(nm) to match all points.
In order to reduce the time complexity for a nearest neighbor search, applications typically utilize a space-partitioning structure such as a k-d tree. Introduced by Bentley [9], a kd tree is a space partitioning k-dimensional binary tree that organizes a data set by dividing each level by a hyperplane created from the median of the given dimension. A k-d tree is built by splitting space into a set of alternating hyperplanes placed at the median of a first dimension. The points are split along this partition and then recursively divided based on the next dimension's hyperplane. To query the k-d tree for the nearest neighbor to a given point, the algorithm begins at the root node and traverses down the tree, choosing which child node to visit based on which side of the hyperplane the query point exists [17]. After reaching a leaf, the algorithm back-traces up the tree checking the distance to each hyperplane to determine if the adjacent volume needs to be checked for a closer point. Since a binary tree has log 2 (n) levels, the average query time is proportional to O(log(n)) with a worst-case of having to search the entire tree being O(n). However, when numerous queries to the dataset are required, construction of a k-d tree proves a benefit to the alternative of the brute force approach since not every point needs to be compared to the query points.
Other tree-based data structures, including R-trees [19] and octrees [27], have been developed to accelerate nearest neighbor searches. Both structures partition the space of the data comparable to a k-d tree. For R-trees, rectangles cre-ate bounds between clusters of data points with each higher level combining clusters into their minimal spanning rectangle. This approach increases spatial locality of nearby points as rectangles lower in the hierarchy can be stored in nearby memory locations. While the innate structure of the data dictates the regions of an R-tree, an octree partitions the space uniformly. These methods present challenges for beginning at an arbitrary data point since the closet point may not be within the region but in an adjacent one. Searches of tree structures typically begin at the root of the tree. The work presented in this paper allows searches to begin from any arbitrary point within the data set, allowing for extremely fast results when subsequent searches are similar to the previous ones.
One method utilizing octrees described in [16] utilizes previous results to facilitate searches. The tree is constructed on the Voronoi cells computed from the original points rather than the points themselves. Regions of the octree, voxels, contain at most M max intersecting Voronoi cells. These voxels are then accessed through a hash table where each entry is indexed through its level in the tree. The search may need to backtrack up the tree to search adjacent regions. Since the structure of the work presented in this paper does not split the data into regions, utilizing a result for subsequent searches only requires searching the previous result's Delaunay neighbors.
With respect to point registration, methods in [13,38] focus on the accuracy of the second step of Besl's ICP. However, this step of the algorithm executes relatively quickly when compared to the nearest neighbor step. Caching the previous ICP iteration's nearest neighbor correspondences is an approach to accelerate the nearest neighbor phase. A cached k-d tree is presented in [32], where a pointer to the node within the tree is utilized as the starting node for subsequent iterations. Similarly, in [18], if the previous correspondence meets certain geometric constraints, it is used to expedite the nearest neighbor search. This research uses the previous ICP iteration's nearest neighbor correspondence to accelerate a nearest neighbor search using the Delaunay triangulation.
While some applications require an exact nearest neighbor, others only necessitate an approximate match. Typically, an approximate neighbor search returns a result faster than an exact search. A straightforward approach utilizing a k-d tree is to simply traverse from the root to the leaf node containing the query point [18]. Additionally, this method lessens the space complexity required for storing the data structure, as only the leaf nodes and median values for each hyperplane axis need to be stored. In [28], a method is presented that can configure itself to the desired degree of accuracy required for the application, allowing the user to balance between precision and speed. In [34], the k-nearest approximate neighbors are calculated in constant time by utilizing locally sensitive hashing to partition datasets into clusters. While some meth-ods employ an approximate neighbor to accelerate ICP, as shown in Marden [26], the work in this paper only utilizes approximate neighbors as a means to accelerate the exact neighbor search. Thus, the traditional ICP algorithm from Besl can be utilized to verify correctness.

Delaunay triangulation
In 2D, a Delaunay triangulation of a point set is a set of triangles where the vertices are members of the point set and no points are located within the circumcircles created by the vertices of each triangle. When extended to 3D, the triangle simplex generalizes to a tetrahedron simplex, and the circumcircles become circum-spheres. Importantly, the condition of no points being located in each circum-sphere still holds. The concept can be extended to d-dimensions [15]; however, the work presented here focuses on three-dimensions for 3D sensing-based applications. Figure 3 depicts the full Delaunay triangulation of a 3D point set. Figure 4 visualizes a closer view of all the edges connected to a single point (shown in pink) connecting to the neighbor points (shown in blue).
Delaunay triangulations have many applications including path planning [6,23,39], surface reconstruction [4,12,21], and many others. Mulchrone [30] uses Delaunay triangulation to generate nearest neighbors within a data set with the goal of calculating strain. Mulchrone subsequently removes the edges that form the convex hull; the remaining edges con- nect nearest neighbors. In contrast to Mulchrone, the work presented in this paper utilizes all edges of a Delaunay triangulation. Keeping all edges allows the neighbor search to quickly traverse the data set, whereas removing edges from the convex hull limits the distance between vertices within the graph. Similarly, in [36] Delaunay triangulations are utilized to cluster points by removing undesired and redundant edges. While clustering may be utilized in nearest neighbor searches to determine the starting vertex of a traversal, the work in this paper utilizes other methods for determining starting nodes as presented in Sect. 4. These examples show Delaunay triangulation can be used to efficiently assign a variety of point correspondences.
The creation of the Delaunay triangulation structure can be completed offline. Lee [22] describes an iterative, divideand-conquer algorithm to create the Delaunay triangulation structure. In this paper, we will be using the Open3D library [41] to create the Delaunay triangulation offline. Open3D uses the Qhull library [7] to compute the convex hull and stores the 3D triangulation as a series of 3D simplexes, or tetrahedrons.
While these methods identify neighbors within a data set, the Full Delaunay Hierarchies (FDH) [11] algorithm utilizes a Delaunay traversal to determine the nearest neighbor of a source point to a target dataset. The traversal in FDH only moves from vertices of lower index to higher index, disallowing traversal in both directions. This detail increases the complexity of implementing the algorithm. Additionally, the authors state their method cannot directly extend to greater than two dimensions. In contrast, this research focuses on traversal in both directions with three-dimensional point clouds.

Delaunay traversal
This section presents the novel pairwise Euclidean distance nearest neighbor matching algorithm, which utilizes the Delaunay triangulation of the target 3D point set. Once the Delaunay triangulation for a reference point set has been generated, the nearest neighbor of a query point is found by traversing the edges of the Delaunay graph. The algorithm utilizes a greedy approach where at each node of the Delaunay graph, if a connected node is closer than the current visiting node to the query point, the algorithm proceeds to check the nodes connected to the new current visiting mode. This process continues until there are no closer nodes to the query point connected to the current visiting node.

Notation
Given a point set X = {x 1 , . . . , x n } of n points of d dimensions, the nearest neighbor of a point p ∈ P is Squared distance is utilized to avoid a square root operation. Because multiple points within X may be equidistant to p, the result of the nearest neighbor function might not be unique. A Delaunay triangulation of a point set is a graph G = (1..k, E) consisting of edges from each point x i to the Delaunay neighbors x j =i . To find the nearest neighbor of a source point within the target point set utilizing the Delaunay triangulation, the algorithm can start at an arbitrary start point, To store the full Delaunay triangulation, for each point x i in the data set there is a corresponding list of values: In Eq. 4,ê i, j is the normalized Delaunay edge from point x i to one of its Delaunay neighbors x j . In Eq. 5, m i, j is one half the distance, or the midpoint, between x i and x j . Lastly, in Eq. 6, j is the reference index where point x j is stored in the data set.

Delaunay creation
The Delaunay triangulation is pre-calculated offline when performing a pairwise Euclidean distance nearest neighbor matching with a known 3D point set. This step is done utilizing the Open3D library [41] discussed in Sect. 2.3. In order to prepare the Delaunay graph for the nearest neighbor traversal, the edges of each tetrahedron are extracted and grouped according to connections when visiting a point. Thus, each edge is represented twice. Although this factor increases required memory space, the graph becomes bidirectional and allows for an arbitrary start-point.

Traversal
Given an arbitrary query point p, we may find p's nearest neighbor using a series of traversals, or walks, along the Delaunay graph. This sequence of walks is taken from a point x 0 to point x, where x satisfies Eq. 1. Figure 5 depicts an example of what the algorithm calculates at each node of the walk. In Fig. 5, x j = x is the nearest neighbor of query point p. The first step in the algorithm is calculating where u is the vector from x i to p. Next, the dot product between u andê i, j is determined. In Fig. 5, unit vectors from x i to x j , x k , and x l , each constitute a specificê i, j . Sinceê i, j is a unit vector, the dot product c is the scalar component of u in the direction ofê i, j . The dashed lines in Fig. 5 depict where the projections intersect each Delaunay edge. The scalar c can then be compared to the midpoint m i, j (green points in Fig. 5) from Eq. 5. If the scalar c is greater than m i, j , p is closer to x j than x i . Since the algorithm is seeking the point closest to p, the point x j corresponding to the largest c is the next node in the graph to visit. Thus, x i is replaced by point x j , and the algorithm continues. However, if no c is greater than m i, j , the algorithm stops and returns the current point x i as the nearest neighbor. The pseudocode for the Delaunay traversal is shown in Algorithm 1. Section 4 covers selecting an efficient starting point, x 0 , to reduce the number of walks.

Space and time complexity
Seidel provides a proof illustrating O(n d/2 ) as the upperbound for the number of faces within a Delaunay triangulation. With d = 3, a 3-dimensional object will have on the order n 3/2 faces. As such, each point will be a member of  vertices have been rejected, meaning the entire dataset has been visited. Thus, in the worst case, the algorithm will need to visit √ n vertices. However, because with each step of the traversal, the algorithm advances toward the neighbor and away from farther vertices, the worst case will rarely occur. In fact, we present heuristics in Sect. 4 to show how to greatly reduce the number of traversals required.
In contrast, a k-d Tree has log 2 (n) levels and an average query time proportional to O(log(n)). However, since the search may involve the entire tree, the worst-case becomes O(n). Similarly, the average and worst-case for an octree are O(log(n)) and O(log(n)), respectively. Additionally, implementing a search of a tree structure typically utilizes recursion. While recursion may execute efficiently on a central processing unit (CPU), a graphics processing unit (GPU) will typically have a smaller stack size. For instance, the depth limit for Compute Unified Device Architecture (CUDA) kernels is 24 [33]. Thus, the search algorithm needs to be implemented iteratively rather than recursively. With respect to memory layout, since the Delaunay traversal structure is vectorized, it lends toward keeping cache coherency. Keeping the data structure in cache facilitates faster searches with less time copying data from main memory.

Degenerate cases
A few cases exist where the Delaunay triangulation produces suboptimal traversal results. 2D point sets result in edges isolated only to close neighbors. Similar to issues encountered   [16], which utilizes Voronoi cells to determine neighbors, a 2D circle results in each point being connected to only the two closest points. In Fig. 6, each vertex connects only to the two closest vertices. The edges highlighted in green are the edges connected to the pink vertex, which in turn connects to the blue vertices. Additionally, the Delaunay triangulation of a line or plane only connects neighbor points. Looking at Fig. 7, the highlighted edges in green connect the pink point only to the four closest points in blue. The main issue arising from this locality is the number of walks required to find the neighbor increases if the algorithm only can walk to points physically close.
To effectively utilize the traversal algorithm on 1D or 2D point sets, virtual points can be introduced during the Delaunay creation. These virtual points exist in a higher dimension, and ideally at distances relatively far from the original points. This virtual bounding box introduces new points and facilitates in the Delaunay generation. After creation, the virtual points can be removed.

Delaunay walk variations
In Sect. 3, we covered how the Delaunay traversal algorithm identifies the nearest neighbor to a query point. Because the Delaunay triangulation forms a connected graph, the traversal algorithm can begin at an arbitrary point and be guaranteed to find the nearest point. Consequently, a heuristic or hysteresis can accelerate the correspondence search with the knowledge that there will be a path from the starting node to the nearest neighbor.
Leveraging the iterative aspect of ICP, the nearest neighbor search utilizing the Delaunay Walk can be further accelerated. With each iteration of ICP, the source cloud is incrementally transformed to align with the target. After the first few iterations, individual points move relatively small distances from their previous correspondence. Thus, if the correspondences of the previous iteration are cached, the nearest neighbor search can begin at a likely already close point and require fewer traversals to determine the new correspondence. To demonstrate, our research motivated variations focusing on the starting node of the traversal.
The Zero Delaunay walk serves as a benchmark, starting from an arbitrary mode for all traversals.

Zero Delaunay walk
In the Zero Delaunay walk variation, the traversal begins from a predetermined node each iteration. The walk originates with no search for a starting node. Each search can begin at the node closest to the center of mass, the most traveled node, or an arbitrary node. This variation requires the least memory and offline preparation; however, it requires the most number of traversals compared to the other variations as seen in Figs. 8 and 9. This variant has the largest mean and variance in the number of walks.

k-d approximate Delaunay walk
In the KD-ANN Delaunay walk variation, the search begins with an approximate nearest neighbor (ANN). The ANN is found by conducting a depth-first search of a k-d tree and returning the leaf node without back-tracing. The ANN returned is usually a close neighbor, but it can return the exact nearest neighbor. Figures 8 and 9 show this effect. By starting the walk from a close neighbor, the number of nodes in the walk decrease substantially. Since this algorithm monotonically converges, a close starting point can significantly

Previous nearest neighbor Delaunay walk
In the PNN Delaunay walk variation, the walk starts from the node found in the previous iteration of ICP. Since aligning the point clouds occurs iteratively, the previous nearest neighbor generally resides close to the current nearest neighbor. No additional computation is required; therefore, the search leverages the previous ICP iteration's work to make the current iteration more efficient. Additionally, increasing the size of the dataset has little effect on the number of walks required. When doubling the points in the Teapot model in Fig. 8a and b, the mean number of walks only increases from 1.39 to 1.56. However, the variances nearly double, indicating there are proportionally more walks of longer length in the larger dataset. Figure 9a and b depicts a result similar to the Teapot data set. The mean number of walks for the Dragon only increases from 1.87 to 2.12, and these values are not dissimilar to the Teapot. While the variances again nearly double, these values are still significantly less than the first two variants.

PNN optimized Delaunay walk
In the PNN Optimized Delaunay walk variation, the KD-ANN Delaunay walk is used for the first iteration and the PNN Delaunay walk is used in all subsequent iterations. By combining aspects of these variations, this walk variation is the most efficient. Looking at Figs. 8 and 9, this algorithm presents the smallest average number of walks as well as the smallest variance of all four walk variants.
In addition to requiring fewer walks, the two variants utilizing previous neighbor results gain an advantage from increased cache hits. Since the preceding iteration utilized the previous neighbor, it will likely remain in cache memory, (a) (b) Fig. 9 Frequency of walks taken for 31k (a) 62k (b) Dragon 3D point set facilitating quick access. Consequently, the Delaunay edges of the previous neighbor likely persist in the cache, allowing the walk to initiate with no access to main memory.

Experiments
To demonstrate the speed and accuracy of the Delaunay Traversal, we conducted various experiments on several 3D point data sets: Teapot, Aircraft A, Aircraft B, Sports Car, and Dragon, all shown in Fig. 10. With these data sets, the runtimes of nearest neighbor methods presented in this paper were tested against the two traditional methods of Brute Force and k-d Tree. These tests involved matching the point sets to themselves with rotational offsets as well as adding artificial noise. In the second set of experiments, we integrated our ICP with the accelerated Delaunay Traversal into a stereo vision pipeline. The vision pipeline processed images from virtual and real cameras with truth data collected a motion capture system. All algorithms were implemented in both C++ and CUDA to demonstrate execution on a serial and vector processor. The CPU utilized for these experiments is the AMD Ryzen Threadripper 3970X 32-Core processor [1] at 3.9 GHz with 64 GB of RAM. The GPU is the NVidia GeForce RTX 3080 [2]. Please see [37] for the full CUDA [33] GPU ICP implementation used in this paper.

Clean and noisy 3D point sets
In these experiments, the 3D point sets were registered to themselves with ICP utilizing each nearest neighbor matching algorithm. The 3D point sets were evaluated at a base fidelity and downsampled by a factor of 2. The nearest neighbor algorithms compared are: Brute Force, k-d Tree, Zero Delaunay, KD-ANN Delaunay, PNN Delaunay, and PNN Optimized Delaunay. ICP was executed with a maximum iteration count of 100 and an error of 1E −11. The target point set was kept at identity, and the source points iterated through a series of rotations ranging from −20 to 20 degrees in 10 degree increments in roll, pitch, and yaw. This deterministically guarantees each algorithm receives the same registration inputs. Lastly, noise is added to the target points, as seen in Fig. 15a.
Because identical point clouds are registered to themselves, ICP returns the exact rotation and translation with little to no error. Additionally, as long as each nearest neighbor method provides accurate results, ICP will perform with the same accuracy. Each run of ICP in these experiments did return little to no error, demonstrating the accuracy and validity of each nearest neighbor method. Figure 11a shows the runtime required for converging the source 3D point set onto the target 3D point set on the CPU. The PNN Delaunay nearest neighbor algorithm executed the fastest the CPU. Figure 11b shows the runtime required for converging on the GPU. The PNN Optimized Delaunay converged most quickly on the GPU. The experiments show the most efficient algorithms are the PNN Delaunay on the CPU and PNN Optimized Delaunay on the GPU. For this reason, these algorithms are compared against each other to compare performance between a CPU and GPU. Figure 12a shows a runtime comparison of the CPU PNN Delaunay algorithm and the GPU PNN Optimized Delaunay nearest neighbor algorithms. The GPU runtime is about an order of magnitude faster than the CPU. Figure 13a shows the GPU speedup over the CPU for these algorithms. The GPU achieves just under a 140× speedup on the Dragon(62k) 3D point set. Figure 12b shows the overall total ICP runtime and Fig. 13b shows the speedup. The GPU achieves about a whole order of magnitude in runtime and 25× speedup when comparing the most efficient algorithms. Figure 14a shows the teapot(8k) 3D point set runtime per ICP iteration by the CPU PNN Delaunay algorithm. Figure 14b shows the same for the GPU PNN Optimized Delaunay algorithm. These are evidence the PNN Delaunay variants leverage the prior iteration's nearest neighbor. These algorithms increase in efficiency as iterations deepen. The negative slope in the fitted lines shows the increased efficiency.
When executing nearest neighbor in ICP, Figs. 8a, b and 9a, b show the walk frequencies of the Delaunay Walk variations on the teapot and dragon 3D point set fidelities. The PNN Delaunay and PNN Optimized Delaunay have the lowest average walks. PNN Optimized Delaunay has a lower average walk distance and variance than PNN Delaunay because the first iteration executes KD-ANN Delaunay to find a preferable starting location. Walk variance is important when executing on the GPU. If a single thread has a long walk distance, the whole thread block and possibly kernel will wait for the slowest thread to finish executing. This could also explain why the PNN Optimized Delaunay performs best on the GPU. Figure 15b shows the ICP total runtime with adding noise, using virtual, and using real images. A dirty target point cloud does not negatively affect the runtime. This proves the robustness of the ICP and nearest neighbor algorithms.

Virtual and real stereo block matching
In order to demonstrate the accuracy and correctness of the Delaunay traversal, we integrated the Delaunay nearest neighbor method as step 1 of ICP. From the known geometry of our reference model, a 1/7th scale model aircraft, the Delaunay triangulation was generated. Then, a set of sensed points from stereo vision image processing were registered onto the reference 3D point set. With the truth position of the model aircraft from a motion capture system, we verified our Delaunay traversal by assessing the accuracy of the pose estimation.
In this experiment, stereo imagery was collected from simulating a leading aircraft with rear-facing stereo cameras mounted on its belly. The objective is to compute the relative pose between the leading and a trailing aircraft using only the stereo cameras. In doing so, the vision processing enables autonomous station keeping between two aircraft. This functionality has many applications, including formation flight and automated aerial refueling, among others. The stereo camera produces a 3D point set generated via stereo block matching [20]. In terms of the experimental setup, the stereo cameras' resolution was 4096 × 3000 and field of view was 28.7 degrees. The baseline of the stereo system was 0.5 meters. A scale model of Aircraft B was used in an approach starting at a distance of approximately 20 meters to 14 meters from the cameras. The approach was captured with truth data recorded from a Motion Capture system [3]. Thus, the accuracy of ICP is able to be tested against the truth from the motion capture system (< 1 cm, < 0.1 deg). Lastly, to mimic a real-time application, ICP is executed with a maximum iteration count of 30 and RMS delta-error between iterations of 1E − 6. The video located at [5] shows a visualization of the motion capture environment.
In order to validate the correctness of the motion capture and stereo vision systems, truth data from the approach were replayed in a virtual simulation as a digital twin to the real images. Utilizing virtual images provides a baseline of accuracy for the motion capture and stereo vision systems. Assuming virtual images will produce less noise than real images, we utilize the virtual simulation to test and prototype various algorithms. Confidence in the system increases if processing real and virtual images produce similar results.
Virtual cameras with parameters and pose set to correspond with their real counterparts captured the 3D environment. Figure 16 shows the virtual images in the lower left and right and the real images in the upper left and right, the sensed points in yellow, and the reference points in red. In the middle left of Fig. 16 the red tubes depict the location and view direction of the stereo cameras in the real environment and virtual environment-the real and virtual stereo cameras are co-incident and share the same coordinate frame with respect to the truth motion capture system.
As seen in Fig. 17a and b, when registering points generated from virtual images, the CPU and GPU implementations perform with the same accuracy. Furthermore, when the virtual images are replaced with real images, and the points generated from stereo block matching contain more noise,   Fig. 18a and b. Tables 1 and 2 show the positional and rotational mean magnitude error of ICP when using real images. The CPU and GPU have a very close error, the minor difference mostly due to hardware differences causing rounding differences. The CPU reports a slightly more accurate position, whereas the GPU reports a slightly more accurate rotation. Overall, there is less than a magnitude of 1.7 centimeters in positional error and a magnitude of 1.3 degrees of rotational error.

Conclusion
In this study, highly efficient and novel nearest neighbor matching algorithms were introduced and implemented based on a Delaunay triangulation. In comparison with the traditional k-d tree, the Delaunay traversal is shown to provide around 90% speedup. When integrated into ICP, the previous correspondences can cached to facilitate fewer traversals, and our research shows increasing the size of the 3D point-set has little effect on the time to find a correspondence. Finally, the accuracy of the Delaunay traversal was validated by experiments involving point registration with point-sets from real and virtual point sets.

Future work
Future work is planned to execute more real-time experiments with different 3D point sets. Additionally, research involving updating the reference point set to reflect changes in the viewpoint of the sensors is ongoing. If the Delaunay triangulation needs to be recomputed, parallel processes such as those presented in [24] may be utilized. Finally, it is planned to integrate the nearest neighbor algorithms into other applications, like point-to-plane ICP.