SoftPool++: An Encoder–Decoder Network for Point Cloud Completion

We propose a novel convolutional operator for the task of point cloud completion. One striking characteristic of our approach is that, conversely to related work it does not require any max-pooling or voxelization operation. Instead, the proposed operator used to learn the point cloud embedding in the encoder extracts permutation-invariant features from the point cloud via a soft-pooling of feature activations, which are able to preserve fine-grained geometric details. These features are then passed on to a decoder architecture. Due to the compression in the encoder, a typical limitation of this type of architectures is that they tend to lose parts of the input shape structure. We propose to overcome this limitation by using skip connections specifically devised for point clouds, where links between corresponding layers in the encoder and the decoder are established. As part of these connections, we introduce a transformation matrix that projects the features from the encoder to the decoder and vice-versa. The quantitative and qualitative results on the task of object completion from partial scans on the ShapeNet dataset show that incorporating our approach achieves state-of-the-art performance in shape completion both at low and high resolutions.


Introduction
Several data representations exist for 3D shapes.One common choice is the use of spatially discretized representations such as volumetric data (Yang et al. 2017;Wang et al. 2019b;Yang et al. 2018a).Alternative popular choices are implicit descriptions (Park et al. 2018;Chibane et al. 2020) as well as sparse 3D coordinate-based representations such as point clouds (Yang et al. 2018b;Xie et al. 2020b;Yuan et al. 2018) and 3D meshes (Groueix et al. 2018).Among this latter category of 3D data formats, point clouds are arguably the simplest, since they store 3D coordinates without any Communicated by Akihiro Sugimoto.
This paper focuses on the point cloud completion task.The goal is to fill out occluded parts of the input 3D geometry represented by a partial scan, in a way that is coherent with the global shape while preserving fine local surface details.This is a useful task for many real world applications since occluded regions are normally present as part of most 3D data capture processes within, e.g., SLAM or multi-view reconstruction pipelines.State-of-the-art approaches targeting this task are based on neural networks and mostly rely on learning how to deform a set of 2D grids at different scales into 3D points, based on global shape descriptors typically represented by PointNet (Qi et al. 2017a) features.Examples of these approaches are FoldingNet (Yang et al. 2018b), Atlas-Net (Groueix et al. 2018) and PCN (Yuan et al. 2018).
To overcome the aforementioned problem related to information loss due to feature compression at the level of the encoder-decoder bottleneck, GRNet (Xie et al. 2020b) suggests to preserve fine geometry details by discretizing the features via volumetric feature maps used at the different layers of the encoder.It also suggests using volumetric U-Net (Yang et al. 2018a) to build skip connections between the encoder and the decoder, eventually merging the obtained features with the input point cloud.The idea of leveraging skip connections among different layers of an encoderdecoder model follows the successful paradigm already exploited for volumetric shape completion, in particular 3D-RecGAN (Yang et al. 2017) and ForkNet (Wang et al. 2019b).While effective, converting sparse point cloud features into volumetric maps brings in all the disadvantages of discretized 3D data representations with respect to point clouds, in particular the loss of fine shape details, the inability to flexibly deal with local point density variations, as well as the unpractical trade-off between 3D resolution and memory occupancy.
Recently, we have demonstrated how, by means of sorting features based on their activations rather than applying max pooling, we can build up point clouds embeddings that store more informative features for a point cloud with respect to PointNet.This feature-learning approach, named Soft-Pool (Wang et al. 2020b), obtained state-of-the-art results for different point cloud-related tasks, such as completion and classification.In this work, we build up on our previous work (Wang et al. 2020b) to propose a more complete endto-end framework.Our contributions are two-folds and are listed as follows: 1. We generalize our feature extraction technique into a module called SoftPool++.This module introduces truncated softpool features aimed to decrease the memory requirements of the original method during training, making it compatible with off-the-shelf GPUs.Notably, a disadvantage of the SoftPool features (Wang et al. 2020b) is that each point is processed independently from the rest.Due to this, the proposed module further processes the truncated softpool features with regional convolutions in order to recognize the relationships between the feature points.In contrast to Wang et al. (2020b) that applies their feature once, this module can be applied multiple times as demonstrated in our architecture, which uses it across multiple layers.2. We propose a novel encoder-decoder architecture characterized by the use of point-wise skip connections.By connecting corresponding layers between encoder and decoder, this has the advantage of preserving fine geometric details from the given partial input cloud.This is to the best of our knowledge the first approach using skip connections for unorganized sets of 3D feature maps, relaxing the need of spatial discretization as deployed in Xie et al. (2020b), with benefits in terms of completion accuracy and memory occupancy.In addition, we also adapt the discriminator from TreeGAN (Shu et al. 2019) for the shape completion problem to further improve our model.
Our method is evaluated on ShapeNet (Chang et al. 2015) for the task of shape completion and on ModelNet (Zhirong et al. 2015) and PartNet (Mo et al. 2019) for the task of classification.Figure 1 illustrates a teaser of the shape completion results.It compares the architectures that are built on PointNet (Qi et al. 2017a) and SoftPool (Wang et al. 2020b) features.Visually, we show the advantage of the reconstructions that rely on SoftPool features as they are remarkably more similar to the ground truth.Moreover, the figure also highlights the improvements of SoftPool++ with respect to our previous approach (Wang et al. 2020b).

Related Work
Based on the focus of our contributions, we browse through the relevant methods in 3D object completion from partial scans and the use of skip connections with 3D data.

3D Object Completion
Inspired by the way humans perceive the 3D world from 2D projections, 3D-R2N2 (Choy et al. 2016) builds recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input RGB images sequentially to recover the 3D geometries.To further improve the reconstruction, a coarse-to-fine 3D decoder was presented in Pix2Vox (Xie et al. 2019) as well as the residual refiner in Pix2Vox++ (Xie et al. 2020a).Due to the recent popularity of the attention mechanisms, AttSets (Yang et al. 2020) proposed to build attention layers to correlate the image features from different views.In contrast, our 3D reconstruction in this paper focuses on only a single depth image.Taking a depth image of an object from an arbitrary camera pose, the objective of 3D object completion is to complete its missing structure and build its full reconstruction.Focusing on learning-based completion, most related work can be categorized depending on the input data they process-voxelized grid or point cloud.Interestingly, a notable work from OcCo (Wang et al. 2021a) demonstrates that the weights trained for completion are also valuable for other tasks like segmentation and classification.
Voxelized Grid Due to the popularity of 2D convolution operations in CNNs (Azad et al. 2019;Kirillov et al. 2020;Yang et al. 2020) for RGB images, its straightforward extension to 3D convolutions on volumetric data also rose to fame.3D-EPN (Dai et al. 2017) and 3D-RecGAN (Yang et al. 2017) are the first works on this topic, where they extended the typical encoder-decoder architecture (Noh et al. 2015) to 3D.Adopting a similar architecture, 3D-RecGAN++ (Yang et al. 2018a) and ForkNet (Wang et al. 2019b) utilize adversarial training with 3D discriminator to improve the reconstruction.
The main advantage of volumetric completion is the structure of its data such that deep learning methods developed for RGB images can be extended to 3D.However, this advantage is also its limitation.The fixed local resolution makes it hard to reconstruct the object's finer details without consuming a huge amount of memory.

Point Cloud
Having the inverse problem, point clouds have the potential to reconstruct the object at a higher resolutions but exhibited so far a limited application in deep learning due to its unstructured data.Note that, unlike RGB images or voxel maps, point clouds do not have a particular order, and the number of points varies as we change the camera pose or the object.
Targeted to solve the unordered structure of point clouds, PointNet (Qi et al. 2017a) proposes to implement maxpooling in order to achieve a permutation invariant latent feature.Based on this one dimensional feature, FoldingNet (Yang et al. 2018b) proposes an object completion solution that deforms a 2D rectangular grids by multi-layer perceptron (MLP).By increasing the number of 2D rectangular grids, AtlasNet (Groueix et al. 2018) and PCN (Yuan et al. 2018) added more complexity as well as details into the reconstruction.MSN (Liu et al. 2020) then further improves the completion by adding restrictions to separate different patches apart from each other.Moreover, Cycle4Completion (Wen et al. 2021) is also based on PointNet features but solves the problem by training with an unsupervised cycle transformation.Moving away from the global feature representation, PointNet++ (Qi et al. 2017b) samples the local subset of points with farthest point sampling (FPS) then feeds it into PointNet (Qi et al. 2017a).Based on this feature, PMP-Net (Wen et al. 2020b) completes the entire object gradually from the observed regions to the nearest occluded regions.SnowflakeNet (Xiang et al. 2021) also uses the PointNet++ features to split points in the coarsely reconstructed object to execute the completion progressively.In addition, building a similar feature as PointNet, ME-PCN (Gong et al. 2021) takes both the occupied and the empty regions on the depth image as input for 3D completion, showing the advantage of masking the empty regions in completion.
Unlike the methods which are dependent on a vectorized global feature to solve the permutation invariant problem, RFNet (Huang et al. 2021) and PointTr (Yu et al. 2021) produce several global features in their encoder.On one hand, RFNet (Huang et al. 2021) uses their features to complete the object in an recurrent way by concatenating the incomplete input and the predicted points level by level.On the other, PointTr (Yu et al. 2021) relies on transformers to produce a set of queries directly from the observed points with the help of positional coding.In effect, PointTr (Yu et al. 2021) does not need to compress the input into a single vector.
The recent work from PVD (Zhou et al. 2013), GRNet (Xie et al. 2020b) and VE-PCN (Wang et al. 2021b) leverage both the point cloud and the voxel grid representations.Unlike most works that rely on Chamfer distance to optimize the model, PVD (Zhou et al. 2013) uses a simple Euclidean loss to optimize the shape generation model from the voxelized point cloud representation.GRNet (Xie et al. 2020b) first voxelizes the point cloud, processes the voxel grid with deep learning and converts the results back to point cloud.While this solves the unorganized structure of the point clouds, its discretization removes its advantage on reconstructing in higher resolutions.VE-PCN (Wang et al. 2021b) improves the completion by supplementing the features of the decoder in the volumetric completion with the edges.This method then converts the voxels to point clouds by Adaptive Instance Normalization (Lim et al. 2019).
Another solution is presented in our previous work Soft-PoolNet (Wang et al. 2020b) that builds local groups of features by sorting them into a feature map.2D convolutions are then applied to the feature map.Consequently, this approach is able to deal with unorganized point clouds and achieve reconstruction results at high resolution.We build upon SoftPoolNet (Wang et al. 2020b) and generalize the feature extraction into a module which we call SoftPool++.This then allows us to connect multiple modules in an encoderdecoder architecture.As a consequence, we achieve better quantitive and qualitative results.

Skip Connections in 3D
Skip connections were initially proposed for image processing (Mazaheri et al. 2019;Kim et al. 2016;Gao et al. 2019;Azad et al. 2019) then later adapted for 3D volumetric reconstruction (Yang et al. 2017(Yang et al. , 2018a;;Wang et al. 2019b).Given a point cloud as input, the methods like GRNet (Xie et al. 2020b) and InterpConv (Mao et al. 2019) require to convert the input point cloud to voxel grids.
Aiming at alleviating this limitation on point clouds, the work from Std (Yang et al. 2019) bypasses the encoder features into decoder point-by-point while GACNet (Wang et al. 2019a) constructs a graph from the points then constructs the skip connection with the graph.The problem of these point-wise skip connections is that new points cannot be introduced in the decoder.To solve this, SA-Net (Wen et al. 2020a) groups PointNet++ (Qi et al. 2017b) features in different resolutions with KNN.The skip connection from the encoder then matches the resolution of the decoder.
Contrary to these methods, in the context of object completion, the objective of our skip connection is compensate for the lost data in the encoder and bypass the observed geometry to the decoder.We also introduce the concept of feature transformation to compensate for the difference between the features from the encoder and decoder.Later in our evaluation, we found that the skip connection is a crucial step to achieve higher accuracy.Moreover, the SoftPool++ features also contribute to make our skip connection simpler.Since it is an organized feature, we avoid the time-consuming KNN, which significantly decreases our inference time.

Feature Extraction
Given the partial scan of an object, the input to our network is a point cloud with N in points written in matrix form as , where each point is represented as the 3D coordinates On one hand, the first objective of this section is to build a feature descriptor from the unorganized point cloud such that the feature remains the same for any permutation of the point cloud in P in .On the other hand, the second objective is to generalize this process into a feature extraction module that takes an arbitrary input P in .In this way, the proposed module can be implemented at multiple instances in our architecture.

SoftPool Feature
From the point cloud vector, we then convert each point into a feature vector f i with N f elements by projecting every point with a point-wise multi-layer perceptron (Qi et al. 2017a) with its parameters assembled in W MLP .Thus, we define the Note that we applied a softmax function to the output neuron of the perceptron so that the elements in f i range between 0 and 1.
Throughout this section, we refer to the toy example in Fig. 2 to visualize the various steps.This example assumes that there are only five points in the point cloud such that N in = 5 as shown in Fig. 2a.
One of the main challenges in processing a point cloud is its unstructured arrangement.If we look at Fig. 2a, changing the order of the points in P in reorganizes the rows of the feature map F. There is consequently no guarantee that the feature map remains constant for the same set of points.To solve this problem, we propose to organize the feature vectors in F so that their k-th elements are sorted in a descending order, which is denoted as F k .Note that k should not be larger than N f .This is demonstrated in Fig. 2a where we arrange the five feature vectors from F = [f i ] 5 i=1 to F k = [f i ] i={3,5,1,2,4} by comparing the k-th element of each vector.
The features in SoftPoolNet (Wang et al. 2020b) repeat this process for all of the N f elements in f i .Altogether, the feature is a 3D tensor with the dimension of 2b.Finally, we assemble the SoftPool features F * by taking the N r rows with the highest activations of all F i in F .Since each row in F i is equivalent to a point, we can then interpret the N r rows of F i as one region in the point cloud, summing up to all N f regions in F * .
Although both PointNet (Qi et al. 2017a) and SoftPoolNet (Wang et al. 2020b) utilize MLP in their architecture, they have significant differences on handling the results thereof.Compared to the max-pooling operation in PointNet (Qi et al. 2017a), the motivation of the SoftPool feature is to capture a larger amount of information and to further process it with regional convolution operations, as explained later in Sect. 4. In (b), we concatenation of the first N r rows of F k to construct the 3D tensor F * which corresponds to the regions with high activations then truncated to assemble F

Generalizing and Truncating the SoftPool Feature
In practice, we noticed that we can generalize the SoftPool feature formulation to an arbitrary input feature P in -thus, alleviating the definition of points-to produce the Soft-Pool features F .From this perspective, we can construct an architecture with a series of SoftPool feature extractions.Therefore, we take the point cloud as the input to the architecture and extract the first SoftPool features.Then, after processing the first features, we can then extract the second features from them and so on.This is discussed later in Sect. 4 with an encoder-decoder architecture.However, the drawback of such architecture is the size of the SoftPool features.With a dimension of N r × N f × N f , the memory footprint increases with the size of the feature but we are constrained by the memory size of our off-the-shelf GPU.Notably, in Wang et al. (2020b), they set the feature dimension N f to a small value of 8.In this work, since we are interested in building a series of SoftPool features in an encoder-decoder architecture, N f increases up to 256 in the latent space.
Hence, we propose to further truncate the SoftPool features to N r × N f × N s , where the third dimension takes the first N s matrices in F * as illustrated in Fig. 2b.To distinguish from Wang et al. (2020b), we refer this as the Truncated SoftPool feature, denoted as F in Fig. 2b.

Regional Convolutions
Considering that each point in the cloud independently goes through MLP while the operations thereafter to produce the truncated SoftPool features rely on sorting, each row of our feature remains independent from each other.However, in contrast to max-pooling which produces a vector, our feature is a 3D tensor which can undergo convolutional operations.
Instead of applying the same kernel to all regions as Wang et al. (2020b), we generalize the regional convolutions and impose distinct kernels for each region.We first split F = [ Fr ] N s r =1 into separate regions Fr and correspondingly apply a set of kernels W conv = {W r } N s r =1 .Assigning the concatenated output tensor as , we can formally describe this operation as for the r -th region.
The dimension of each kernel is where N k indicates the number of neighbors to consider and N out is the desired size of the output P r .Note that the kernels convolves on the entire width of Fr , i.e.corresponding to its width N f .This implies that we only pad on the vertical axis.Similar to other convolutional operators, the stride s distinguishes between a convolutional and deconvolutional operation.If the stride is greater than 1, Fr is downsampled, while it is upsampled if the stride is less than 1.

SoftPool++ Module
Now, we have all the components to build the feature extraction module as shown in Fig. 3, which we call SoftPool++.Since P in is defined as the input point cloud, we generalize the input of the module as F in where we set F in = P in in the first layer.Hence, the input matrix F in goes through a 3layer perceptron then builds the truncated SoftPool features.Thereafter, we perform regional convolution and reshape the results by squeezing the third dimension to finally acquire our output feature matrix F out .When constructing our architecture in Sect.4, the encoder and decoder are distinguished primarily on the stride s.In this paper, we show the versatility of this novel module to act as an encoder and decoder as well as to refine a coarse point cloud with more elaborate details.
The differences between decoding from PointNet features and SoftPool++ features are evident in Fig. 4, where we replace the PointNet feature in MSN (Liu et al. 2020) with a SoftPool++ feature with the same size of 1024.By replacing the PointNet (Qi et al. 2017a) encoder in MSN (Liu et al. 2020) with our SoftPoolNet++ encoder, we show that the SoftPool++ feature supplements the MSN's decoder where all the wheels are clearly separated from the body of the SUV, while the original PointNet feature in MSN follow the more generic structure of a vehicle with tiny gaps between wheel and body.This proves that SoftPool++ makes our decoder able to take all observable geometries into account to complete the shape, while the max-pooled PointNet feature cannot deal with geometric structures which are rarely or not at all seen in the training data.

Network Architecture
The volumetric U-Net (Çiçek et al. 2016;Yang et al. 2018a) in 3D-RecGAN (Yang et al. 2017) and GRNet (Xie et al. 2020b) has shown significant improvements in object completion as it injects more data from the encoder to the decoder in order to supplement the compressed latent feature.Without the skip connection in U-Net, we end up losing most of the input data as it goes through the encoder.Consequently, the decoder starts hallucinating the overall structure without being faithful to the given information.Inspired by this idea, we introduce a novel U-Net connection that directly takes the point cloud as input, i.e.without the need of voxelization at any stage of the network.
Our network architecture is composed of an encoderdecoder structure with a skip connection as shown in Fig. 5.Such connection between encoder and decoder makes the completion more likely to preserve input geometries.The encoder is composed of consecutive feature extraction modules from Sect.3.4 to downsample the input to the latent feature while the decoder is composed of the similar feature extraction modules to upsample to the output.As discussed in Sect.3.4, the stride s is a significant parameter to distinguish the two layers.Table 1 lists the values of all the parameters for the module in the convolution and deconvolution layers.out .However, instead of simply concatenating them, we introduce a square matrix R that transforms the features from the encoder as F conv1 out R. Note that the multiplication by the transform is on the right side because the points are arranged row-wise in P in , which implies that the feature vectors are also arranged rowwise.Subsequently, we concatenate the two matrices into [F deconv1  out , F conv1 out R] that serves as the input to the feature extraction module, producing F deconv2 out .In order to avoid randomly large values in the transformations and attain numerical stability during training, we regularize the transformation matrix to be orthonormal such that all elements are between [−1, 1] and it mathematically satisfies RR = I where I is an identity matrix.Geometrically, the regularizer imposes to rotate the features by R.

Minimum Density Sampling
Since the number of points in the input cloud vary, the results of deconv2 would also produce a varying number of points, i.e.with 1024 + N in 4 points from Table 1, since it depends on the input dimension.Thus, we include a Minimum Density Sampling (MDS) (Liu et al. 2020) in the decoder to standardize the output to a coarse resolution of 2048 points.The coarse resolution is then refined with two deconvolutional operations to 16,384 points.The motivation of adding the MDS is to help the final deconvolutional layers to converge faster during training.Later in Sect.6, we investigate further the differences between the point clouds from the skip-connection as well as the coarse and fine as illustrated in Fig. 5.

Loss Functions
Since the main goal here is point cloud completion (Groueix et al. 2018;Yang et al. 2018b;Yuan et al. 2018), we first analyse whether the predicted point feature P out matches the given ground truth P gt through the Chamfer distance L complete = Chamfer(P out , P gt ) .
(2) Furthermore, we optimize our architecture with two sets of loss functions that are related to the feature extraction module for all the convolution and deconvolution layers in the architecture from Sect.3.4 as well as the skip connection with the feature transform from Sect. 4.

Optimizing the Feature Extraction Module
For the feature extraction module that utilizes SoftPool features, we adopt the same loss terms as in Wang et al. (2020b), where their main objective is to optimize the distribution of the features across different regions.

Intra-regional Entropy
The ideal case for the feature vector f i is a one-hot code, i.e. each vector gets assigned to only one region.To accomplish this goal, we measure the probability of f i belonging to region k in all N s regions by directly applying the softmax on the elements of the vector as . (3) This implies that P is maximized when f i is a one-hot code, with the k-th element equal to one.However, in presence of multiple peaks in the vector, P(f i , k) might decrease significantly.Therefore, by taking the entropy into account, the where B is the batch size, tries to enforce the feature vector to have one peak so that it confidently falls into just one region.

Inter-regional Entropy
The drawback of L intra is that all feature vectors have the same peak at the k-th element.Looking at a more holistic perspective, the inter-regional loss function aims at distributing the features across different regions.It relies on maximizing the regional entropy given that We can then define the loss function as since the upper-bound of E r is computed as − log 1 N s or simply log(N s ).

Boundary Overlap Minimization
In addition to optimizing the holistic distribution of the points, we also incorporate a loss function that is applied on pairs of regions i and j.We collect a set of points B i j from region i with activations of region j larger than a threshold τ , i.e.set to 0.3.Similarly, we also take the inverse B j i .Consequently, we squeeze the overlaps between the two regions.
By minimizing the Chamfer distance between B i j and B j i , we obtain the loss that tries to make the overlapping sets of points smaller, ideally down to just a line.In Fig. 6, we visualize the difference of optimizing with and without L boundary , where the distribution of the point cloud is less noisy on the occluded regions such as the armrest.
Notably, this loss function is general enough to be effectively applied also on other methods that rely on a subdivision of the point cloud into different regions, such as AtlasNet (Groueix et al. 2018), PCN (Yuan et al. 2018) and MSN (Liu et al. 2020).In Sect.7.2, we formally evaluate these methods with and without L boundary .

Feature Duplicate Minimization
The last loss term imposes that the resulting truncated SoftPool feature F takes most of the features from original F so that it avoids duplicates.To make the earth moving distance (Li et al. 2013) more efficient, 256 vectors are randomly selected from F and F. In practice, Fig. 7 visualizes the effects of L preserve in the reconstruction, where lower weights of this loss produce a large hole, while incorporating this loss builds a point cloud with similar densities.

Optimizing the Skip Connection
We first visualize a subset of the architecture and focus on the skip connection as shown in Fig. 8. Here, we define P partial as the partial reconstruction on F deconv2 out contributed by the skip connection with the feature transform.However, note that P partial is not a subset of F deconv2 out .It is produced by taking the input point cloud through conv1, feature transform and deconv2.
Since the skip connection aims to maintain the given input structure, we define a loss function that acts as an autoencoder such that L skip = Chamfer(P partial , P in ) . (10) In addition, based on Sect.4, we regularize the values in the feature transform such that

Discriminative Training
Recognizing the advantages from TreeGAN (Shu et al. 2019), we also investigate applying discriminative training conditions on the input partial scan P in .In this case, we first introduce the conditional feature maps P out |P in and P gt |P in by concatenating them along the point axis.We build our discriminator D with the same parametric model proposed in Shu et al. (2019).By restricting the output of the discriminator to a range between 0 and 1, we can then apply to optimize our completion architecture while to optimize the discriminator D. In practice, we impose the loss functions in ( 12) and ( 13) alternatively in order to optimize the completion architecture and the discriminators separately.

Experiments
For all evaluations, we train our model with an NVIDIA Titan V and parameterize it with a batch size of 8.Moreover, we apply the Leaky ReLU with a negative slope of 0.2 on the output of each regional convolution.

Completion on ShapeNet
We evaluate the performance of the geometric completion of a single object on the ShapeNet (Chang et al. 2015) database where they have the point clouds of the partial scanning as input and the corresponding ground truth completed shape.
To make it comparable to other approaches, we adopt the standard 8 category evaluation (Yuan et al. 2018)  Low Resolution At low resolution, we achieve competitive results, attaining the 0.13 × 10 −4 from PMP-Net (Wen et al. 2020b) with the L2-Chamfer distance in Table 2, while we achieve state-of-the-art results when evaluating on the L1-Chamfer distance in Table 3.

High Resolution
We achieve the best results on most objects with the high resolution as presented in Tables 4 and 5 with 8.31×10 −3 and 2.55×10 −3 , respectively.Table 5 also shows that volumetric approaches like 3D-EPN (Dai et al. 2017) and ForkNet (Wang et al. 2019b) having large issues when evaluated in Chamfer distance because the converted point clouds from the fixed volumetric grids are at much smaller local resolutions.
Validating with F-Score@1% Since the Chamfer distance hardly reflect the errors in the local geometry as suggested in Tatarchenko et al. (2019), the evaluation in GRNet (Xie et al. 2020b) uses the metric F-Score@1% that computes the F-Score after matching the predicted point cloud to the ground truth with a distance threshold of 1% of the side length of the reconstructed volume.The evaluations on reconstructing higher resolutions are reported in Tables 6 and 7 on ShapeNet objects provided by the Completion3D (Tchapmi et al. 2019) and MVP (Pan et al. 2021), respectively.Here, the average F-Score with SoftPool++ outperforms the other methods.The tables also validate the benefit of our individual contributions in the overall result.In addition, Table 7 shows that, by applying our SoftPool++ module on the variational coarse sub-architecture of VRCNet (Pan et al. 2021), the average performance of the fine reconstruction reached the state-ofthe-art with the improvement from 78.1 to 79.9%.
Moreover, the results from high resolution reconstruction also validates our conclusion when evaluating against SoftPoolNet (Wang et al. 2020b).With or without the skip connections, our SoftPool++ performs better than (Wang et al. 2020b).

Qualitative Evaluation
Similar to Sect.6.1, the objects in this section are also trained from and evaluated on ShapeNet (Chang et al. 2015).However, for the qualitative results in Fig. 9, we show the results in the original points resolution specified in their respective methods.(Qi et al. 2017a) feature.

Comparison against PointNet
From Fig. 9, the max-pooling operation from the PointNet (Qi et al. 2017a) feature is embedded in FoldingNet (Yang et al. 2018b), PCN (Yuan et al. 2018) and MSN (Liu et al. 2020).We noticed that these methods are either over-smoothens the reconstruction or start introducing noise.
On one hand, FoldingNet (Yang et al. 2018b) and PCN (Yuan et al. 2018) smoothens out the reconstruction so that the fine details such as the armrest of the chair are no longer visible and the wheels of the car are no longer separated.On the other, MSN (Liu et al. 2020) tries to reconstruct the finer details but produces a noisy point cloud.Contrary to these methods, we achieve a smoother surface reconstruction with with visible geometric details of the object like the armrest and the wheels.

Advantage of Skip Connections
We also explore the combination of 3D-GCN (Lin et al. 2020) and TreeGAN (Shu et al. 2019) that uses graph convolutions in an encoder-decoder architecture.Its latent feature is presented as a vector with a length of 1024.Without the skip connection, several inconsistencies emerge.For instance, the shape of the boat is slimmer than the ground truth while one dimension of the bookshelf is thicker.These information are part of the input but are not propagated to the output.
Among these methods, GRNet (Xie et al. 2020b) achieves similar quantitative results compared to our approach in Table 5.They also build skip connections between their encoder and decoder.However, as input to the architecture, they first voxelize the input point cloud.After going through the encoder-decoder, they convert the 3D grid back to point cloud.Due to the discretization of the point cloud, this affects the results of GRNet (Xie et al. 2020b).It fails to reconstruct thin structures like the antenna on the boat and the vertical stabilizers of the jet.In addition, it tried to fill up the hole in the box which should have remained empty.In contrast, our method that processes directly on the point cloud can handle these cases.
Improvements from SoftPoolNet (Wang et al. 2020b).Moreover, we compared the proposed method against the previous SoftPoolNet ( Wang et al. 2020b) to reveal the advantages of our novel approach.From Fig. 9, while the previous method fails to reconstruct the four corners of the box and the wheels of the jet, the new method is more consistent to the ground truth.Overall, our novel approach reconstructs sharper geometries with less noise and less holes.
Other Methods There have been some trend to re-purpose method that were originally tailored for semantic segmentation such as PointCNN (Li et al. 2018) to train for object completion.Since they both use point clouds, the intuition is to use the local convolutions from Li et al. (2018) to upsample the point cloud from its partial scan to its completed structure.
Unfortunately, these methods fails to reconstruct the objects because it is not the intended purpose of the architecturein semantic segmentation, their input and output point cloud remains the same.

Classification on ModelNet and PartNet
In addition to shape completion, we also evaluate our approach in terms of classification on the ModelNet10 (Zhi-rong et al. 2015), ModelNet40 (Zhirong et al. 2015) and PartNet (Mo et al. 2019) datasets.Note that ModelNet40 contains 12,311 CAD models classified into 40 categories while PartNet contains 26,671 models with 24 categories.Similar to the other approaches such as 3D-GAN (Wu et al. 2016), RS-DGCNN (Sauder et al. 2019), VConv-DAE (Sharmaet al. 2016), FoldingNet (Yang et al. 2018b) and KCNet (Shen et al. 2018), we also implemented a self-supervised training to extract features from the input point cloud then a supervised training to train a linear Support Vector Machine (SVM) (Cortes and Vapnik 1995) to predict the categorical classification.The former relies on the 57,448 ShapeNet models (Chang et al. 2015) as its training dataset while the latter relies on ModelNet (Zhirong et al. 2015) and PartNet (Mo et al. 2019).
It is noteworthy to mention that there is a significant difference from RS-DGCNN (Sauder et al. 2019) in the details of the self-supervised training.On one hand, our method randomly subsamples the point cloud while, on the other, Sauder et al. (2019) includes an additional data augmentation step that randomly decomposes the 3D input structure into different parts then repositions these parts by translation.Since we did not include the additional augmentation from Sauder et al. (2019), our evaluation is a fair comparison against other methods.
The evaluation in Table 8 reports that our model outperforms the accuracy of RS-DGCNN (Sauder et al. 2019) by 4.11% on the ModelNet40 dataset, a sign of the higher descriptiveness in terms of categorical information.The improvement of 2.47% from our approach compared to Soft-PoolNet (Wang et al. 2020b) is also obvious, proving that the proposed SoftPool++ feature and skip-connection together are more advantageous for classification.Similar results are also obtained on ModelNet10 (Zhirong et al. 2015) and Part-Net (Mo et al. 2019).

Efficiency
In addition to the evaluation in terms of shape completion and categorical classification, we also compare in Table 9 the properties of our model such as its memory footprint and inference speed, as well as the type of data being processed.The cost of outperforming SoftPoolNet (Wang et al. 2020b) becomes evident on the memory footprint and the inference time.Compared to SoftPoolNet (Wang et al. 2020b), the memory footprint of our method is approximately doubled due the increase in the number of parameters from the multiple feature extraction modules in our architecture.This also triggers a larger inference time than SoftPoolNet (Wang et al. 2020b) from 0.04 to 0.11 seconds.A similar trend is associated to other approaches that divides the point cloud into regions such as AtlasNet (Groueix et al. 2018) and MSN (Liu et al. 2020), i.e.we achieve significantly higher accuracy in reconstruction but also increase the memory footprint and the inference time.
However, if we look at the overall data, we observe that the proposed method at 61.7MB consumes remarkably less memory than the other point cloud approaches such as GRNet (Xie et al. 2020b) at 293MB and PointCNN (Li et al. 2018) at 497MB, as well as the volumetric approaches such as 3D-EPN (Dai et al. 2017) at 420MB and ForkNet (Wang et al. 2019b) at 362MB.An important reason why their models are so large in memory usage is that 3D convolutions are applied in multiple layers of their architectures, while our approach is mainly composed of 2D convolutions only.Among those approaches with large memory consumption, GRNet (Xie et al. 2020b) is one of the top performers in point cloud completion.Since their architecture relies on volumetric grids where they convert the input point cloud to voxel grid then convert back to a point cloud, this affects not only their memory footprint but also their inference time, which is 8 times higher than ours.
Compared to approaches composed mainly of MLPs, our model reports a comparable size to PCN (Yuan et al. 2018) while having a faster inference time than MSN (Liu et al. 2020).The reason is that although our 2D convolution kernels introduces a additional dimensions, the newly added dimension N k of 32 is comparably much smaller than the feature dimension N f of 256 at which MLPs operates.Notably, approaches based on KNN search such as PointCNN (Li et al. 2018) and 3D-GCN (Lin et al. 2020) usually take much longer for inference.

Ablation Study
Based on the evaluation from ShapeNet (Chang et al. 2015), we further analyze our proposed method's behavior through an ablation study.In this section, we demonstrate the advantage of SoftPool++ over PontNet; expound on the claims of

Replacing PointNet with SoftPool++
In addition to the comparison in with our SoftPool++ features, while keeping their decoders unchanged.In this way, we have a one-to-one comparison of PointNet and SoftPool++ features.Since these works depend on a PointNet features (having a dimensionality of 1024), we also build up our SoftPool++ features with the same size.Remarkably, the use of Soft-Pool++ features improves performance in all tested methods, i.e.the performance of FoldingNet (Yang et al. 2018b), PCN (Yuan et al. 2018) and MSN (Liu et al. 2020) improves respectively by 0.14 × 10 −3 , 0.23 × 10 −3 and 0.22 × 10 −3 .

Loss Functions
Tables 3 and 8 include an ablation study that investigates the effects of the individual loss functions from Sect. 5.For both experiments, we notice that all loss functions are critical to achieve state-of-the-art results.Note that we have shown in Fig. 6 and cabinet completion in Fig. 7 to demonstrate the contributions of L boundary and L preserve in the reconstruction.L boundary in other methods.An interesting idea is the capacity of L boundary to be integrated in other existing methods that join multiple deformed 2D patches together to form the final output.Since the patches in AtlasNet (Groueix et al. 2018), PCN (Yuan et al. 2018) and MSN (Liu et al. 2020) are frequently overlapping nearby patches, we tried to integrate  4 and 3 evaluate this idea and prove that this activation helps FoldingNet (Yang et al. 2018b), PCN (Yuan et al. 2018) and AtlasNet (Groueix et al. 2018) perform better, improving the Chamfer distance with at least 1 × 10 −4 on the resolution of 2048 and 1 × 10 −3 on resolution of 16,384.

Skip Connection with Feature Transform
One of the key contributions in this paper is the introduction of skip connections with feature transforms on point cloud.Our ablation study in Tables 3 and 2 also includes the numerical advantage of having the skip connection in our architecture, improving the Chamfer distance by 0.65×10 −4 in L1 and 1.27 × 10 −4 in L2.
In addition to the numerical advantage, we also interpret these values through some examples in Fig. 10 where we reconstruct lamps.Without the skip connection, the model recursively simplifies the given partial scan until it reaches the latent feature.Due to the oversimplification, the output then builds the closest generic shape of the lamp.Contrary to that, with the skip connection, the model preserves the input structure and incorporates the given partial scan into the final reconstruction.In effect, the result is closer to the ground truth.
We also perform an ablation study on the regularization L R of the feature transform R in Tables 4, 5, 3 and 2. Compared to our complete framework, the results trained without the skip-connection drops by 0.64 × 10 −4 .However, when trained with the skip-connection but without the regularization of R, the results drops by 2.14 × 10 −4 which is significantly larger.Therefore, it is noteworthy to mention that training with skip-connection but without the regularization performs worse than removing the skip-connection altogether.This clearly shows the advantage of the regularization term on the feature transform.

Activations from the SoftPool++ Features
Given the input point cloud, we explore how SoftPool++ sorts the points on the first feature extraction module in the architecture.For this experiment, we visualize the points based on the value of the first column in F which is the result of MLP as shown in Fig. 2. Therefore, Fig. 11 highlights the activations associated to this feature.Noticeably, due to MLP, the points can undergo much more than just a linear transformation of its absolute coordinates.
Continuing our analysis, we move further into examining how the truncation sizes (N s , N r ) and the output dimension N f influence the completion.Table 10 summarizes this evaluation on ShapeNet (Chang et al. 2015) as we vary these values on the second SoftPool++ module in our architecture.As described in Table 1, since our SoftPool++ feature is fixed with 256 rows, we then set N r × N s = 256.Note that the next ablation study focuses on changing the number of rows by independently setting N r and N s .For the ease of training and evaluation for all (N s , N r ) and N f , we do not apply discriminative training D for Table 10.The table indicates that we reach the minimum Chamfer distance as soon as N s reaches 8, N r reaches 32 and N f reaches 256.After then, only small improvements of around 0.01 × 10 −3 are attained.Therefore, we select N s = 8, N r = 32 and N f = 256 so that there are less parameters in the model to train which consequently lead to less memory footprint.
The next ablation study alleviates the constraint of having a fixed latent feature dimension where we set N r × N s = 256 in Table 10.In Table 11, we consider different values of N r and N s while setting N f to 256, where we observe that that the error plateaus when N s is 8 and N r is 32.Note that these values matches the optimum values from Table 10 and validates the advantage of truncation.
Considering the numerical advantages of N s , we also explore it visually while keeping N f and N r constant to 256 and 32, respectively.Similar to Fig. 11, Fig. 12 plots the points from the input point cloud that are truncated by N s .By increasing N s from 4 to 16, the resulting feature also increases the amount of structures from the plane.For instance, the wings become more and more visible on the figure.This then raises the question of how much information from the partial scans does the network need to reconstruct  10 where we found the optimum value of N s , i.e.8.Comparisons in Fig. 12 shows that larger values of N s does not further add the points on the body of the plane which is a common part for plane category.

Conclusion
We propose a novel feature extraction technique called Soft-Pool++ that directly processes the point cloud.Compared to the counterpart that heavily relies on the max-pooling operation in PointNet (Qi et al. 2017a), our feature extraction method captures a higher amount of data from the input point cloud by alleviating the limitation of taking only the maximum while also establishing the relation between different points through our regional convolutions.
Structuring multiple SoftPool++ in an encoder-decoder structure, this paper becomes the first to propose a point-wise skip connection with feature transformation.Considering that the given point cloud is continuously downsampled in the encoder, the main advantage of such connection is the capacity to incorporate the input data into the decoder.This then overcomes the loss of information in the encoder.
Examining our contributions on 3D object completion, we discovered that we perform the state-of-the-art especially on high-resolution reconstructions.We also visually demonstrate our advantage and concluded that our reconstructions are sharper, i.e.with less noise in our point cloud; and, captures the finer details, i.e.without over-smoothing different parts of the object.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indi-cate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 2
Fig. 2 Toy examples of the truncated SoftPool feature.Given 5 points in (a), they go through Multi-layer Perceptron (MLP) to produce F. At the k-th element, the vectors are sorted to build F k and consequently F .

Fig. 3
Fig. 3 Overview of the feature extraction module called SoftPool++

Fig. 4
Fig. 4 Object completion results with MSN (Liu et al. 2020) while using PointNet (Qi et al. 2017a) features and SoftPool++ features on its encoder

Fig. 5
Fig. 5 Overview of our object completion architecture where the parameters for the convolution and deconvolution operations based on the feature extraction module are listed in Table1.Note that, in our eval-

Fig. 6
Fig. 6 Our object completion results with and without the influence of L boundary

Fig. 7
Fig. 7 Our object completion results while increasing the weight of L preserve for a single object completion.Both sampled from ShapeNet meshes, PCN(Yuan et al. 2018) and TopNet(Tchapmi et al. 2019) supplement two set of datasets individually for low and high resolutions evaluation, which contain 2048 and 16,384 points, respectively, where the inputs are provided with 2048 points.Notice that the low resolution dataset provided by TopNet is also commonly referred to Completion3D benchmark.Since previous works report their results in terms of L1/L2 metric of the Chamfer distance separately, we also report our results in both resolutions (2048 and 16,384) and metrics (L1 and L2).We compare against state-of-the-art point cloud completion approaches such as PCN(Yuan et al. 2018), FoldingNet(Yang et al. 2018b), AtlasNet(Groueix et al. 2018), Point-Net++(Qi et al. 2017b), MSN(Liu et al. 2020) and GRNet(Xie et al. 2020b).To show the advantages over volumetric completion, we also compare against 3D-EPN (Dai et al. 2017) and ForkNet(Wang et al. 2019b) with an output resolution of 64 × 64 × 64.As for point cloud resolutions, PCN(Yuan et al. 2018), GRNet(Xie et al. 2020b) and SoftPoolNet(Wang et al. 2020b) report the best performance with 16,384 points while MSN(Liu et al. 2020) presents their final output resolution with 8192 points.Aiming at a fair numerical comparison at different resolutions, we modify the last layers of these architectures so as to attain the same resolution for all methods.

Fig. 10
Fig. 10 Object completion results with and without the influence of the skip-connection

Fig. 11
Fig. 11 Visualization of the first row of F the first SoftPool++ module in our architecture

Fig. 12
Fig. 12 Visualization of the truncated points on the input point cloud different values of N s

Table 1
Dimensions and parameters on each feature extraction module in our architecture Note that the input to the architecture is the point cloud P in with a dimension of N in × 3 while the output is another point cloud P out with 16,384 × 3

Table 2
Evaluation on the object completion based on the Chamfer distance trained with L2 distance (multiplied by 10 4 ) with the output resolution of 2048

Table 3
Evaluation on the object completion based on the Chamfer distance trained with L1 distance (multiplied by 10 4 ) with the output resolution of 2048

Table 4
Evaluation on the object completion based on the Chamfer distance trained with L1 distance (multiplied by 10 3 ) with the output resolution of 16,384

Table 5
Evaluation on the object completion based on the Chamfer distance trained with L2 distance (multiplied by 10 3 ) with the output resolution of 16,384

Table 6
Evaluation on the object completion based on the F-Score@1% trained with L2 Chamfer distance and the output resolution of 16,384

Table 7
Evaluation on the object completion on the F-Score@1% trained with L2 Chamfer distance and the output resolution of 16,384

Table 9
Overview of different object completion methods.
our loss function; investigate the value of the skip connection with feature transform in our architecture; and; delve deeper on what happens in the SoftPool++ module.

Table 10
Influence of N f and (N s , N r ) on the L2 Chamfer distance (multiplied by 10 3 ), evaluated on the output resolution of 2048 points Bold indicates the best performance achieved in certain column Table 11 Influence of N s and N r on the L2 Chamfer distance (multiplied by 10 3 ), evaluated on the output resolution of 2048 points f is set to 256 Bold indicates the best performance achieved in certain column