6DOF pose estimation of a 3D rigid object based on edge-enhanced point pair features

The point pair feature (PPF) is widely used for 6D pose estimation. In this paper, we propose an efficient 6D pose estimation method based on the PPF framework. We introduce a well-targeted down-sampling strategy that focuses on edge areas for efficient feature extraction for complex geometry. A pose hypothesis validation approach is proposed to resolve ambiguity due to symmetry by calculating the edge matching degree. We perform evaluations on two challenging datasets and one real-world collected dataset, demonstrating the superiority of our method for pose estimation for geometrically complex, occluded, symmetrical objects. We further validate our method by applying it to simulated punctures.


Introduction
The goal of 6D pose estimation is to detect the position and orientation of a target object to obtain a rigid transformation from the object coordinate system to the camera coordinate system.Pose estimation has been considered an important part of target recognition and scene understanding.Pose estimation has also been widely used in industrial and medical fields.In the medical field, with the continuous development of medical imaging, computer-assisted surgery technology, and 3D vision technology, 3D vision-based navigating robot-operated surgery has become a trend [1,2].In 3D vision-based navigating robot-operated surgery, the registration of preoperative 3D models reconstructed by medical imaging and intraoperative spine point clouds acquired by depth cameras is crucial.
In real surgical scenarios, the human spine has a complex geometry and features of high occlusion and symmetry [3], thus potentially leading to algorithmic miscalculations.There is no satisfactory and universal solution for this problem.In this work, we propose a method of pose estimation for special geometries of the spine.For the complex shape of the spine, we found that more spine feature points exist on the edges.
Therefore, an edge-focused sampling method is used to select stable and salient points to generate stable transformation hypotheses.For the ambiguity of spinal symmetry, we consider that the difference in details between symmetric and highly occluded objects can be effectively distinguished by the degree of edge matching.
Overall, the contributions of our work are summarized as follows.
• A well-target down-sampling strategy combines edge information.It effectively retains edge points and points with large curvature variations.Robust hypothesis generation is achieved by sampling stable feature points.• A pose hypothesis verification method considers the degree of matching with edge points.It has an early exit strategy to reduce time costs.• An experimental platform of robot-operated positioning based on this method is implemented.We use the position-based visual servoing scheme to control the robot arm to reduce the deviation of the drilling position.

Related works
This section reviews the correlation algorithms of pose estimation in 3D point clouds, point pair features and their modifications.The algorithm based on global features [4][5][6] has good performance in calculating time and memory consumption.However, the algorithm is limited in clinical applications due to its sensitivity to occlusion and noise, and the need to pre-isolate the region of interest from the background.The algorithm based on local features [7][8][9][10] is more robust to occlusion and clutter.Nevertheless, it will lead to additional computation time during the subsequent matching and hypothesis validation, so it does not meet the requirements of a real-time surgical navigation system.The method based on template matching [11] can detect texture-free targets but is sensitive to surgical instrument occlusion.The main application of the point-based method is the Iterative Closest Point algorithm (ICP) [12] and its variants [13,14].

Pose estimation methods
The ICP algorithm and its variants are dependent on the initial pose and are usually used in pose refinement.Deep learning-based methods [15][16][17][18][19] perform well in public 3D datasets.However, deep learning-based methods require significant computational power and time to label datasets.The difficulty of collecting medical samples and the small amount of data hinders the application of deep learning-based methods for surgical navigation.

Point pair feature
In 2010, Drost et al. [20] proposed a rigid 6D pose estimation method based on point pair feature (PPF), which is a compromise between local feature and global feature methods, striking a good balance between accuracy and speed.PPF describes the surface of an object through global modeling of four-dimensional features defined by directional point pairs.These features are used to find the corresponding relationships between scene and model point pairs, generate numerous candidate hypotheses, and then cluster and sequence the candidate poses to obtain the final hypotheses.PPF features are low-dimensional features of the oriented points and are suitable for objects with a rich surface variation.Moreover, the PPF descriptors with global significance have stronger discriminative power than most local features.It is suitable for the complex structure and occluded objects studied in this paper, so we choose the PPF framework as the backbone.
Because of the advantages of PPF, many improvement schemes based on PPF have been proposed.Choi et al. [21] proposed a color point pair feature (CPPF), which uses color information to significantly improve the discrimination and accuracy of traditional point pair features.Drost et al. [22] proposed the concept of geometric and textured edges.Geometric edges are obtained using the intensity image and depth image to construct multimodal point pair features.Liu et al. [23] proposed a novel descriptor named Boundary-to-Boundary-using-Tangent-Line (B2B-TL) to estimate the pose of industrial parts.Vock et al. [24] utilized point pair features that are on edge for the quick generation of transformation guesses in a Random Sample Consensus setting.Inspired by the above article, we propose a down-sampling method using a combination of edge points and geometric high curvature feature points for the spine.A pose hypothesis verification method based on edge matching is proposed to make it more competitive in detecting geometrically complex and symmetrical objects such as the spine.
The rest of this paper is organized as follows.Section 3 describes the original PPF method, and Section 4 describes our proposed method and the design of robot-operated positioning experiments.Experimental results for the spine dataset and the public datasets are given in Section 5. Section 6 concludes the paper.

PPF Method
Our approach is based on the original PPF method [20].To better understand this article, we will introduce the basic framework of this method in this section.

Point pair feature
The point pair feature is used to describe the relative distance and normals of a pair of oriented points, as shown in Fig. 2. Given a reference point p r and a second point p s with normal n r and n s respectively, the PPF is a four-dimensional vector which is defined as: where d = p r − p s , ∠(a, b) is the angle between the vector a and the vector b.

Drost's pipeline
The PPF method can be divided into offline global modeling and online local matching.
In the offline global modeling phase, to create a description of the model, the model is down-sampled using uniform sampling.Then the point pair features are computed and quantified for all permutations of model point pairs.The point pair features are made to be stored as hash keys in a hash table by the quantization function, and the value encodes the pose of the feature relative to the model.The pose of the model is encoded by storing the index of the reference point p r and an angle a m , the latter of which represents the angle between the projection of the model point pair concatenation and the positive direction of the Y-axis.
The online local matching phase consists of two parts: (1) find the correspondence between point pairs using four-dimensional point pair features; (2) generate hypothetical poses from the correspondences and then cluster to obtain the best object pose.In the first part, the reference points are sampled from the scene.Uniform down-sampling of the scene point cloud is performed to obtain a set of scene points, and then the i-th (default i = 5) scene point is used as the reference point.Make this reference point calculate the PPF together with all other scene points.And map it to the model reference point and angle α m by matching using the previously constructed hash table.This process effectively solves the correspondence problem between point pairs by matching point pairs with the same quantized PPF.In the second part, the α s of the scene point pairs are calculated.α s represents the angle between the connected projection of the scene point pairs and the positive direction of the Y-axis.For each matched point pair feature, the angle α = α m − α s , and then voting is performed in the Hough space of (p r , α).The maximum value of the number of votes in the Hough space is extracted to form a pose hypothesis.After the valid candidates are generated for all reference points, cluster the similar poses grouped by judging the rotation and translation that do not exceed the threshold.The group with the highest cumulative number of votes is the resulting pose hypothesis.

Overview of Our Approach
We propose a new 6D pose estimation algorithm, the specific framework of which is shown in Fig. 3. Based on PPF, we mainly make the following improvements.First, for the pre-processing operation of the input model, we filter out the point pair features that tend to interfere with the matching based on the normal vector angle for the input model.Secondly, for the pre-processing operation of the scene point cloud, we use a clustered down-sampling method that preserves the edge point cloud.Finally, the pose verification operation is performed by checking the matching degree of edges to filter out wrong poses.The proposed improvements are described in the following sections.

Offline training
In the offline training phase, all point pair features of the model are extracted and stored in a hash table to create a global model description.However, due to self-blocking, the global description contains some redundant point pair features that never appear in the input scene.The redundant point pair features not only increase the search time in the online matching phase but also increase the matching error.To mitigate the negative impact of redundant point pair features, we adopt a method based on [25] to determine the visibility problem of point pair features by using the normal vector angle between point pairs.If the angle between the normal In the processing of the filtering (d), we filter out PPF features with angles higher than 175 • or lower than 5 • by judging the normal vector angle between the point pairs.PPF features are extracted and stored in a hash table (e).In the online matching stage, the scene point cloud is input (f), In the pre-processing of the scene point cloud (g), we use a clustered down-sampling method that takes into account the normal vector information, and focus on the edge point cloud and the points with large curvature.The PPF features extracted from the scene point cloud (c) are matched to the hash table, and the candidate poses are generated by voting and pose clustering(h).Each candidate pose is then post-processed(i).The pose with the highest matching score is selected by an improved edge-based pose verification method.Finally, we use ICP to refine the final result pose.
vectors of two oriented points is higher than 175 • , we consider the point pair as almost invisible.Therefore, the storage of point pair features is not performed.On the other hand, it is common for the traditional PPF method to degrade when the object has many repetitive features, such as large planes.Therefore, we do not store the normal vector angle of two oriented points less than 5 • , so the algorithm focuses more on the geometrically-rich point Fig. 4 When p1 is used as the reference point, p2 which has a normal vector angle of more than 175 • with p1, will not appear in the same view due to the visibility constraint of the viewpoint.Due to the specificity of the plane structure, the points in the same plane such as p3 are easily mapped to the same hash bin in the hash table, which reduces the performance of the algorithm.
pair features.As shown in Fig. 4, we mainly filter out the points that are self-obscured by the viewpoint and the points that are on the same plane.

Pre-processing
In order to accelerate the computation of object poses, the scene point cloud must be down-sampled.Unlike Drost's method [20], we use a clustering down-sampling method that takes into account normal, similar to [26,27].However, We  also focused on the edge points of the point cloud.Edge points can robustly describe the shape of the object, and for complex objects such as spinal bones, feature points have a higher probability of being presented at the edges.Our approach is shown in Fig. 5, where we first create a multi-resolution grid structure to discretize the scene point cloud according to the diameter of the model.Similar points with normal angles difference less than the threshold θ are then merged in a voxel grid.After the first fine-grained sampling, we extract the edge point clouds and continue with a fine-to-coarse multi-resolution sampling strategy for the non-edge points.
To prevent some geometric features from being filtered out in the coarse-grained grid, the threshold θ is gradually reduced proportionally.The above operation can effectively preserve the edge points and the points with large curvature.

Feature extraction
For scene point clouds, we follow the solution proposed in [20], choosing 1/5th of points in the scene as reference points and other points as the second point in the point pair feature.
To improve the efficiency of the matching part, we use the KD-tree structure and adopt the intelligent sampling strategy of Hinterstoisser et al. [28] to select other points within the model diameter d from the model to construct as point pairs.

Pose clustering
To merge similar candidates, we used a hierarchical clustering method [26].If the rotation and transformation between the two candidate poses are less than the predefined threshold, the two candidate poses are grouped.All poses within each cluster follow the same conditions based on the two thresholds of rotation and transformation.Finally, the quaternion average for each cluster is used to calculate a new candidate pose, and the score of each pose is added up to the score of the new candidate pose.

Post-processing
The score of each pose is obtained by adding the votes of the candidates in the cluster.In the presence of sensor noise and background clutter, the score of the poses may not correctly represent the degree of matching.Therefore, we recommend that a more reliable score be calculated through an additional re-scoring process.We observed that in most cases [26][27][28][29],  most of the computational time is spent on pose verification.So to ensure the time efficiency of pose estimation, we propose an edge-based pose hypothesis verification method with an early exit strategy.Edges are distinctive features of an object and can strongly represent the shape characteristics and contours of the object.With the edge information of the point cloud, it is possible to select the correct pose from a set of candidate poses with high probability.In our pose hypothesis verification method, for the input candidate pose, the axis-aligned bounding box (AABB box) of the computed candidate pose is used as the ROI region.The edge points within the ROI region are clustered, and the distance between the edge clustering center and the center of the candidate pose is computed to remove remote and divergent edge points.The reason for using filtering based on the distance to the centroid of edge clustering is that often cluttered edges that are not in the object are discontinuous and distant.The final score for this candidate pose is shown in Eq. 2 below.N ROI denotes the number of edge point clouds of ROI (red part and blue part in Fig. 6) after filtering out outliers (yellow part in Fig. 6).N Matching is the number of edge point clouds close to the candidate poses (red part in Fig. 6), and the degree of edge matching S is given by: The specific steps of the pose verification process are as follows: • The input candidate poses are sorted according to the number of votes, and the maximum number of votes for the candidate poses is V max .The candidate poses are divided into two categories according to V max .The first one is the candidate pose with the number of votes greater than V max /2, which is more likely to be the correct pose.The second category is the candidates with less than V max /2.The number of candidates in this category is much larger than in the first category.• In the first category of candidate poses, we use KD-tree to quickly see how well each pose matches the edges of the scene.Those edge points that are close to the model indicate support for the pose hypothesis, after which the N candidate poses with the highest scores (the value of N is given in Section 5.4) are selected for more detailed filtering using Eq. 2. The reason why we do not directly use the edge match of the whole scene point is that the correctness of the match is greatly reduced when the scene is prone to clutter.If the pose score computed by Eq. 2 is higher than 0.7, it is directly considered as the correct pose and the subsequent computation is stopped.If the calculated N poses are lower than 0.7 but higher than 0.6, the one with the highest score among the N poses is selected.• If the calculated score of the N poses of the first category is lower than 0.6, the poses of the second category are processed in the same way as the poses of the first category above.If the N poses of the second category are also not higher than 0.6, the pose with the highest score from the 2N candidate poses is selected as the final pose.After selecting the final pose, ICP [13] is used to further refine the pose to improve the accuracy of the match.

Hardware composition
The hardware composition of our experiment is shown in Fig. 7.The 3D camera used in the experiments is the Azure Kinect DK depth camera.The robotic arm is the AUBO collaborative robot with six joints for flexible operation, and it is used to perform fixed-point movements to complete operations on the spine.The medical drill is fixed at the end of the robotic arm and is equipped with various drill holes, adjusted for different speeds, pointing at the spine.We build the platform not only in the real environment but also in the simulation environment.

Transformation relationship analysis
In order to control the drill mounted on the robotic arm to be able to drill in the attitude we specify, we perform the coordinate transformation.We should make it clear that the transformation relationship is between the model of the spine, the fixed drilling, the depth camera, the end of the robotic arm and the base of the robotic arm.Finally, we should obtain the expected conversion relationship between drilling and the base of the robotic arm.
First, based on the preoperative surgeon's design, we can obtain the target drill pose and position under the spine model coordinate system in advance and notate it as T s t hope ; After the hand-eye calibration process to get the matrix notated T e c that converts the coordinate system of the camera to the coordinate system of the end of the robotic arm; After the tool calibration process to get the matrix notated T e t that converts the coordinate system of the fixed drilling to the coordinate system of the end of the robotic arm; The transformation matrix from the spine model coordinate system to the camera coordinate system is obtained from the above pose estimation algorithm, denoted as T c s ; The end effector's pose in the robotic base's coordinate system can be retrieved through the robotic arm's controller, and the current pose is notated as T b e0 .In the fixed drill coordinate system, the transformation relationship from a fixed drill to the expected drill attitude is : Finally, the expected conversion relationship between the drilling and the robot arm base is:

Position-based visual servoing scheme
Visual servoing uses visual information extracted from images or point clouds captured by one or more cameras to control the motion of a robot.Visual servoing is a closed-loop system in which vision analysis provides guidance for the robot and robot motion provides new vision analysis for the camera.Closed-loop design can effectively improve the success rate and reduce the deviation.We use a position-based visual servoing scheme, as shown in Fig. 8.The input is the difference between the detected actual pose of the spine and the desired spine pose.The output is the control command of the robot velocity domain, and its purpose is to make the robot move quickly to the target pose state.After the instruction is completed, the camera continues to receive the feedback value of the robot state, forming a closed-loop control system.The closer the real pose is to the desired pose, the smaller the speed of the robot arm will be.When the difference is less than the threshold we set, the speed of the robot arm is 0, and the servo stops.

Experiments
In this section, after describing the datasets required for the experiments, the evaluation criteria, and the state-of-the-art open-source comparison methods, we first evaluate the impact of different parameters on the real spine dataset.Then, in Sections 5.5 and 5.6, a real spine dataset and a publicly available dataset are tested together to investigate the robustness of the algorithm and the validity evaluation of algorithm design.In Section 5.7, we evaluate our method quantitatively and qualitatively on the real spine dataset and show the result of the robot-operated positioning experiment.Finally, to demonstrate the effectiveness of our pose estimation method for objects with symmetry and complexity and its generality for objects of different shapes, we perform a comprehensive comparison of recognition rates and efficiency with state-of-the-art methods on two well-known publicly available datasets in Section 5.8.
The algorithm proposed in this paper is implemented in the Point Cloud Library (PCL) and tested on a PC with a 3.6 GHz Intel(R) Core(TM) i9-10850K CPU and 16GB of RAM, and the algorithm uses OpenMP technology to improve the matching speed.

The pubic datasets
The public dataset contains both UWA dataset [30] and DTU dataset [31].The UWA dataset contains 5 complete 3D models as well as 50 2.5D scenes, where the rhino models are mainly used for interference.Each 2.5D scene contains four to five models, and the degree of model occlusion ranges from 65% to 95%. 5 models and some scenes are shown in Fig. 9(a).The DTU is a large dataset consisting of 45 objects and 3,204 scenarios captured by a structured light scanner, each of which contains 10 objects.These objects belong to three different types: geometrically complex models, cylindrical and planar models.Because some objects are highly occluded.We do not consider objects with more than 98% occlusion.The DTU dataset is challenging because of the high occlusion, high similarity, and diversity of models.Some of the models and scenes are shown in Fig. 9(b).

Spine dataset
To validate the effectiveness of our algorithm for spinal bone pose estimation, we construct a real dataset of the pig spine.The spine model point cloud uses CT scanning of the spine for accurate reconstruction, and then we use professional medical software Mimics Research to convert medical data in DICOM format into 3D models.The experimental platform built in

Evaluation criteria
To determine the pose accuracy, we adopt the Average Distance Metric (ADM) [32] as the pose error metric.It considers both the visible and invisible parts of the 3D model surface.ADM measures the mean Euclidean distance between the model points converted by an estimated pose T and by the true pose T, respectively.In [27], two alternatives of ADM (ADD and ADI) are used to define objects that do not have symmetric properties and those that do.We also use this evaluation criterion.And we accept the pose estimation as positive if the pose error is less than ζ e , where ζ e is related to the object diameter d.The pose error metrics of ADD and ADI are given by: where M is the point cloud of model, c o is the object center.e ADD is computing the average Euclidean distance of the same point after the transformation, while e ADI is computing the average Euclidean distance of the two closest points after the transformation and also takes into account the distance of the object center.
In this paper, we use two evaluation criteria Recognition Rate (RR) and Mean Recall (MR) to evaluate the performance of the algorithm.RR is the ratio of correct poses to all detected poses.MR is the average recognition rate of all objects and is used to measure the detection quality of the algorithm in the entire dataset: where O and S are the sets of all template objects and scenes, respectively, P (o, s) is the set of correctly detected poses, and G(o, s) is the set of ground-truth poses of object o in scene s.

Algorithms for comparison
We compare our method with several baselines using only depth images as input: Drost-PPF [20], Buch-17 [33].We choose the commercial machine vision software MVTec HALCON to implement the original PPF and the optimization algorithm.The open source method Buch-17 [33] is a 3D object recognition method.It uses various three-dimensional local feature descriptors to find point pair correspondences that are constrained to vote in a 1-DoF rotation subgroup of the entire pose, SE (3).Kernel density estimation allows for an efficient combination of voting to determine the resulting pose.The method relies on three-dimensional local feature descriptors, which are evaluated with several descriptors, ECSAD [34], NDHist [35], SI [7], SHOT [36], FPFH [8], and PPF [20].

Parameter analysis
In this subsection, we use the spine dataset for parametric analysis.To analyze each parameter, we use the variable control method for parameter validation.If the parameter does not have a determined value, we use the default value for the assignment.We mainly analyze the following four parameters: the quantization step of distance ∆dist and the quantization step of angle ∆angle, the number of poses using pose verification function N , and the size of AABB box s.
And the ∆dist is related to the diameter of the Model.As shown in Fig. 11, we can observe that the best performance is obtained with ∆angle = 5 and ∆distance = 0.02 , and the higher the number of selected poses, the higher the correct rate, but considering the time consumption, we set N = 9.
An axis-aligned bounding box(AABB) is the ROI used to calculate the pose verification function of candidate poses.
The larger the size of the AABB box, the more points around the pose are considered, so it is easy to filter out some poses that only partially match the spine.We hope to determine the correctness of the poses by considering the matching degree of the points in the AABB box, but when the AABB box is larger than a certain degree, the accuracy of the poses is susceptible to the influence of outliers.The accuracy has a tendency to decrease, so we choose the AABB box size as 140%.

Quality and robustness
In this subsection, we test the performance of our method in terms of Gaussian noise using the real bone dataset and the open dataset UWA.We randomly add Gaussian noise with different standard deviation values on the point coordinates.
The standard deviations range from 0.0, 0.5, 1.0, 1.5, 2.0 (mm).Table 1 shows the robustness of our method.The performance decreases slightly as the noise level increases, but we still perform well on the noisy data.

Effect of sampling on performance
To clearly describe the contribution of the sampling in our method to the final result, we compare it with the sampling method [27] that does not emphasize the edge points.In order to make the number of points sampled by the method focusing on edge points smaller or equal to the compared method, we perform an additional sampling step for non-edge points.As shown in Table 2, the result is a higher recall for sampling more focused on edge points, which we attribute to the fact that stable features are more present on the contours of the object.It has been shown that increasing the number of edge points sampled can improve the matching results.

Effect of pose verification function on performance
In this subsection, we test the edge-based post-processing method and pose verification method in [29].In [29], it is scored based on the overlap of surfaces, and those model points that are close to the scene vote to indicate support for the pose hypothesis.As shown in Table 3, our edge-based post-processing approach is more discriminative.The edge information can robustly describe the geometric contour of the object.When in the ROI region, the higher the matching of edge points, the higher the probability that it is the correct pose.

Effect of Early exit strategy on performance
In this subsection, we focus on the time efficiency of our pose verification function, and we compare three ways of using the pose verification function.The first way is that the poses  are not classified and then entered into the post-processing.The second way is as described in section 4.3.4,but without using an early exit strategy.The third way is our method in this paper, using the early exit strategy when the threshold is exceeded.As shown in Fig. 12, the third one has the shortest time consumption.Our pose classification and an early-exit strategy have a greater improvement in efficiency.The reason why pose classification reduces time consumption is that poses with larger scores are more likely to be the correct pose.Therefore, processing firstly such category of poses with high likelihood and low number can reduce the time significantly.

Recognition results on the spine dataset
As shown from Table 4, the algorithm in this paper achieves great results in terms of correctness compared to other algorithms.The results show that our algorithm outperforms the other competitors.In terms of time cost, the commercial software HALCON is the fastest because it makes full use of the hardware and is also fully optimized at each step.Compared with [33], our method is faster than  most of the 3D descriptor algorithms.Our algorithm can subsequently be further accelerated at each step on the GPU for surgical navigation applications.Fig. 14 shows a qualitative comparison of these methods for several scenes.

Results of navigation and positioning
In order to verify the effectiveness of the robot control method, we verify the feasibility of the scheme in the simulation environment.As shown in Fig. 17, it shows the visualization interface, which is simulated in CoppeliaSim.
In the simulation environment, camera intrinsics, hand-eye  calibration parameters and tool calibration parameters can be directly calculated.However, in the real scene, these parameters can only be obtained by calibration, and there are errors in the calibration process, which can not be accurately calculated.To simulate the real situation, we add noise to these parameters.Based on some experience in real scenarios, add Gaussian noise of σ = 5 for f x , f y and σ = 1 for c x ,c y in the camera intrinsics.Gaussian noise of σ = 0.01 is added for the rotation and translation vectors of the calibration parameters.Under this setting, the robot arm performs a movement of two seconds at a time.During the simulation, the motion trajectory of the camera's optical center (in Fig. 18 As shown in Fig. 13, it shows the qualitative experimental results in the real environment.The left is a pose diagram of the prescribed drill, and the right is the robotic arm's effect.

Recognition results on the public dataset
To demonstrate not only the high recognition rate of our algorithm for complex and symmetric objects (e.g.spine) but also the effectiveness of our algorithm for objects of other shapes, we tested it under the public datasets UWA and DTU.Table 6 shows the recognition results of our algorithm and the other seven algorithms on the UWA dataset.In terms of time consumption, the time consuming of our algorithm is superior to the other algorithms except for the commercial software Halcon.In terms of recognition accuracy, we achieve a 100% recognition rate for most objects, surpassing the other compared algorithms even in the highly occluded case.As shown in Fig. 15 for the qualitative comparison of the UWA dataset, it can be seen that our algorithm still has stable and correct results in the case of high occlusion.
The DTU dataset contains many different types of geometric structure models.In order to more clearly show the effect of our algorithm on different geometric structures, we artificially divided the DTU dataset into geometrically complex, planar, and cylindrical by geometric structure.The geometric classification of DTU dataset is available in Appendix.
We selected some complex and symmetric objects with bone properties from the DTU dataset .The quantitative comparison results of these eight algorithms are shown in Table 5, which shows the clear advantage of our algorithm for this type of object.
We compare our algorithm with other algorithms for different geometric structures in the DTU dataset.The final results are shown in Table 7, and it can be seen that our algorithm outperforms other matching algorithms for various

Cylindrical Planar
Fig. 20 The geometric classification of DTU dataset.

Fig. 1
Fig. 1 Our experiment for robot-operated positioning with visionbased navigation.(a) The depth camera scans the spine for templatebased pose estimation.(b) After matching, the robotic arm points and drills the spine with a predetermined pose and position.

Fig. 3
Fig.3The framework of the proposed method.It is mainly divided into two stages: offline training and online matching.In the offline training stage, the CAD model is input (a) .After downsampling (b), the PPF features are extracted from the model (c).In the processing of the filtering (d), we filter out PPF features with angles higher than 175 • or lower than 5 • by judging the normal vector angle between the point pairs.PPF features are extracted and stored in a hash table (e).In the online matching stage, the scene point cloud is input (f), In the pre-processing of the scene point cloud (g), we use a clustered down-sampling method that takes into account the normal vector information, and focus on the edge point cloud and the points with large curvature.The PPF features extracted from the scene point cloud (c) are matched to the hash table, and the candidate poses are generated by voting and pose clustering(h).Each candidate pose is then post-processed(i).The pose with the highest matching score is selected by an improved edge-based pose verification method.Finally, we use ICP to refine the final result pose.

Fig. 5
Fig.5 The flow chart for clustered down-sampling method considering edge information.

Fig. 6
Fig. 6 Classification of edge points in ROI.

Fig. 7
Fig. 7 Hardware composition of our experiment.The left is the schematic diagram, the right is the physical diagram.

Fig. 9 Fig. 10
Fig. 9 Several object models and two random scenes in the open datasets (a) UWA dataset; (b) DTU dataset.

Fig. 11 Tx 1 − Tx 2 2 ,
Fig. 11 Parameter analysis for spine dataset.The default values of these parameters: the quantization step of distance ∆dist = 0.025 and the quantization step of angle ∆angle = 5, the number of poses using pose verification function N = 10, and the size of AABB box s = 150%.

Fig. 12
Fig. 12 Comparison of time efficiency of three ways of using the pose verification functions based on UWA and spine datasets .
(a)), visual features error (in Fig.18(b)), and camera velocities (in Fig.18(c)) were recorded.It can be seen from the change of feature errors and camera speed that the closer the drill is to the target pose, the lower the speed of the robot arm.It is calculated that the tip distance error is within 1mm and the angle error is within 1 • .

Fig. 18
Fig. 18 Experimental results of the simulation (a) The motion trajectory of the camera's optical center in Cartesian space.(b) Visual features error.(c) Camera velocities.

Table 1
Results of our algorithm after interference by various noises

Table 2
Validation of edge-based sampling method.

Table 3
Validation of our pose verification method.

Table 4
Comparison of eight algorithms on the spine dataset.