We describe here our contributions to make PPFs more discriminative, and more robust to background clutter and sensor noise. We evaluate the improvement provided by each of these contributions, and compare their combinations against state-of-the-art methods in the next section.
4.1 Pre-processing of the 3D Models and the Input Scene
During a pre-processing stage, Drost-PPF subsamples the 3D points of the target objects and the input scene. The advantage is two-fold: This speeds up the further computations and avoids considering too many ambiguous point pairs: Points that are close to each other tend to have similar normals, and generate many non-discriminative PPFs. Drost-PPF therefore subsamples the points so that two 3D points have at least a chosen minimal distance to each other.
This however can lead to a loss of useful information when normals are actually different. We therefore keep pairs even with a distance smaller than the minimal distance if the angle between the normals is larger than 30 degrees, as these pairs are likely to be discriminative. Subsampling is then done as in Drost-PPF, but with this additional constraint.
4.2 Smart Sampling of Point Pairs
After sub-sampling, in Drost-PPF, every scene point is paired with every other scene point during runtime. The complexity is therefore quadratic in the number of points in the 3D scene. In order to reduce computation time, [8] suggests using only every m-th scene point as the first point, where m is often set to 5 in practice. While this improves runtime, the complexity remains quadratic and matching performance suffers because we remove information from the already sampled scene point cloud.
We propose a better way to speed up the computations without discarding scene points: Given a first point from the scene it should be only paired with other scene points that can belong to the same object. For example, if the distance between the two points is larger than the size of the object, we know that these two points cannot possibly belong to the same object and therefore should not be paired. We show below that this leads to a method that can be implemented much more efficiently.
A conservative way to do so would be to ignore any point that is farther away than \(d_\text {obj}\) from the first point of a pair, where \(d_\text {obj}\) be the diameter of the enclosing sphere of the target object, which defines a voting ball.
However, a spherical region can be a very bad approximation for some objects. In particular, with narrow elongated objects, sampling from a sphere with radius \(d_\text {obj}\) will generate many points on the background clutter if the object is observed in a viewing direction parallel to its longest dimension, as depicted in Fig. 2.
In these cases we would like to use a smaller sampling volume where the ratio of scene points lying on the object compared to all other scene points is larger. However, we do not have any prior information on the pose of the object and the first scene point of the pair can lie anywhere on the object. It is therefore impossible to define a single volume that is smaller than the ball of radius \(d_\text {obj}\) without discarding pairs of scene points under certain object configurations that both lie on the target object.
We therefore opted for using consecutively two voting balls with different radiuses: A small one with radius \(R_{\min } = \sqrt{d_{\min }^2 + d_\text {med}^2}\), where \(d_{\min }\) is the smallest dimension of the object’s bounding box and \(d_\text {med}\) is the median of its three dimensions, and the large conservative one with radius \(R_{\max } = d_\text {obj}\). It can be easily seen that \(R_{min}\) is the smallest observable expansion of the object. We will say that a point pair is accepted by a voting ball if the first point is at the center of the ball and its distance to the second point is smaller than the radius of the ball.
We first populate the accumulator with votes from pairs that are accepted by the small ball. We extract the peaks from the accumulator, which each corresponds to a hypothesis on the object’s 3D pose and correspondence between a model and scene point, as in Drost-PPF. We then continue populating the accumulator with votes from pairs accepted by the large ball but were rejected by the small one. We proceed as before and extract the peaks to generate pose and point correspondence hypotheses. This way, under poses such as the one illustrated by Fig. 2, we can get peaks that are less polluted by background clutter during the first pass, and still get peaks for the other configurations during the second pass.
To efficiently look for pairs accepted by a ball of radius d, we use a spatial lookup table: For a given scene, we build a voxel grid filled with the scene points. The number of voxels in each dimension is adapted to the scene points and can differ in x, y, and z dimensions. Each voxel in this grid has size d, and stores the indices of the scene points that lie in it. Reciprocally, for each scene point we also store the voxel index to which it belongs. Building up this voxel grid is a O(n) operation. In order to extract for a given first point all other scene points that are maximally d away, we first look up the voxel the scene reference point belongs to and extract all the scene point indices stored in this voxel and in all adjacent voxels. We check each of these scene points if its distance to the scene reference point is actually smaller or equal than d.
The complexity of this method is therefore O(nk) where k is usually at least one magnitude smaller than n, compared to the quadratic complexity of Drost-PPF, while guaranteeing that all relevant point pairs are considered.
4.3 Accounting for Sensor Noise When Voting
For fast access, the PPFs are discretized. However, sensor noise can change the discretization bin, preventing some PPFs from being correctly matched. We overcome this problem by spreading the content of the look-up table during the pre-processing of the model. Instead of storing the first model point and the rotation angles only in the bin indexed by the discretized PPF vector, we also store them in the (80) neighborhood bins indexed by the adjacent discretized PPFs (there are \(3^4 = 81 - 1\) adjacent bins).
We face a similar problem at run-time during voting for the quantized rotation angles around the point normals. To overcome this, we use the same strategy as above and vote not only for the original quantized rotation angle but also for its adjacent neighbors.
However, as shown in Fig. 3, spreading also has a drawback: Because of discretization and spreading in the feature+rotation space, it is very likely that pairs made of close scene points have the same quantized rotation angle and are mapped to the same look-up table bin. They will thus vote for the same bin in the accumulator space, introducing a bias in the votes.
A direct method to avoid multiple votes would be to use a 3D binary array \((a_{i,j,k})\) for each scene point to “flag” if a vote with the i-th model point, j-th model point as first and second point respectively, and k-th quantized rotation angle around the normal has already been cast, and prevent additional votes for this combination.
Unfortunately, such an array would be very large as its size is quadratic with the number of model points, and we would need to create one for each of the scene point. We propose a much more tractable solution.
Instead of indexing a flag array by the pair of model points and corresponding rotation, we use an array b indexed by the quantized PPFs the point pairs generate. Each element of b is a 32-bit integer initialized to 0, and each bit corresponds to a discretized rotation angle \(\alpha ^{scene}\). We simply set the bit corresponding to a quantized PPF and scene rotation to 1 the first time it votes, and to prevent further voting for the same combination of quantized PPF and scene rotation, even if it is generated again by discretization and spreading. Spreading is achieved by generating for each discretized PPF created at run-time from the scene points the adjacent discretized PPFs, and treating them as the original discretized PPFs.
Note that we use here the scene rotations around the normals since it is only dependent on the scene point pair and not on the model points as shown in [8]. Thus, it is the same for all elements stored in one bin in the look-up table which allows us to leverage it to perform flagging in b.
This solution is more efficient than the direct method discussed above, as the number of possible entries is smaller in practice, thanks to the quantization of the PPFs. b is constant in size for all objects and scales linearly with the number of all possible quantizations of the PPFs. In practice, we quantize each angle and the distance of Eq. (1) into 22 and 40 bins respectively, yielding to \(22^3 \times 40\) possible quantized PPFs. In our implementation, when the number of model points exceeds 650, the direct method takes significantly more time, with a typical slow-down of factor 3 for 1000 model points.
4.4 Generating Object Pose and Postprocessing
To extract object poses from the accumulator, Drost-PPF uses a greedy clustering approach. They extract the peaks from the accumulator, each corresponding to a hypothesis on the object’s 3D pose and correspondence between a model and scene point, and process them in the same order as their numbers of votes, and assign them to the closest cluster if it is close enough, or otherwise create another cluster.
We found that this method is not always reliable especially in case of noisy sensors and background clutter. These result in spurious votes and the number of votes in the accumulator space is not necessarily a reliable indicator for the quality of a hypothesis.
Therefore, we propose a different cluster strategy that takes into account our voting strategy. We perform a bottom-up clustering of the pose hypotheses generated during voting with one voting ball. We allow hypotheses to join several clusters as long as their poses are similar to the one of the cluster center. We also keep track of the model points associated with each hypothesis and only allow a hypothesis to vote for a cluster if no other hypothesis with the same model point has voted for this cluster before. Thus, we avoid that ambiguous and repetitive geometric structures such as planar surfaces introduce biases.
For each of the few first clusters with the largest weights, we refine the estimated pose using projective ICP [10]. In practice, we consider the four first clusters for each of the two voting balls.
To reject the clusters that do not actually correspond to an object, we render the object according to the corresponding 3D pose, and count how many pixels have a depth close to the one of the rendering, how many are further away from the camera—and could be occluded, and how many are closer—and are therefore not consistent with the rendering. If the number of pixels that are closer is too large compared to the total number of pixels, we reject the cluster.
In practice, this threshold is set to 10 %. We also discard objects which are too much occluded. As a last check, we compute areas with significant depth or normal change in the scene, and compare them to the silhouette of the projected object: If the silhouette is not covered enough by depth or normal change, we discard the match. In practice, we use the same threshold that we use for occlusion check. We finally rank all remaining clusters according to how well they fit the scene points and return the pose of the best one only, or in case of multi-instance detection, the whole list of poses from all remaining clusters.