Hyperplane patch mixing-and-folding decoder and weighted chamfer distance loss for 3D point set reconstruction

3D point set reconstruction is an important and challenging 3D shape analysis task. Current state-of-the-art algorithms for 3D point set reconstruction employ a deep neural network (DNN) having an encoder–decoder architecture. Recently, the decoder DNNs that transform multiple 2D planar patches to reconstruct a 3D shape have seen some success. These “patch-folding” decoders are adept at approximating smooth surfaces in 3D objects. However, 3D point sets generated by these decoders often lack local geometrical details, as 2D planar patches tend to overly constrain the patch folding process. In this paper, we propose a novel decoder DNN for 3D point sets called Hyperplane Mixing and Folding Net (HMF-Net). HMF-Net uses less constrained hyperplane, not 2D plane, patches as its input to the folding process. HMF-Net has, as its core building block, a stack of token-mixing layers to effectively learn global consistency among the hyperplane patches. In addition to HMF-Net, we also propose a novel loss for 3D point set reconstruction called Weighted Chamfer Distance (WCD). WCD tries to weight, or amplify, loss from parts of shape that are highly variable across training samples by emphasizing higher point-pair distance values between a generated point set and a groundtruth point set. This helps the decoder DNN learn shape details better. We comprehensively evaluate our algorithm under three 3D point set reconstruction scenarios, that are, shape completion, shape upsampling, and shape reconstruction from 2D images. Experimental results demonstrate that our algorithm yields accuracies higher than the existing algorithms for 3D point set reconstruction.


Introduction
Technology for 3D point set data analysis has been increasingly important, due in part to increasing popularity of 3D range scanners and the development of autonomous vehicles and robots.In recent years, deep learning techniques for 3D point set data have succeeded in a wide range of application scenarios including classification, segmentation, retrieval, pose estimation, registration, and reconstruction [1].It can be said that 3D point set reconstruction is one of the most important and challenging problems in 3D shape analysis.The technique for 3D point set reconstruction is usable kinds of encoder DNN architectures for 3D point set data have been proposed.In contrast, however, decoder DNN and loss function for 3D point set reconstruction have not been fully explored in the literature.
Our goal in this paper is to realize 3D point set reconstruction that is accurate and applicable to diverse shape analysis tasks including shape completion, shape upsampling, and shape reconstruction from 2D images.To achieve this goal, we improve the patch-folding decoder DNN architecture [2,[11][12][13] and the Chamfer Distance (CD) loss [9], both of which are widely used in the previous studies on deep learning-based 3D point set reconstruction.
To motivate our approaches, we first review the patchfolding decoder and the CD loss.The patch-folding decoder reconstructs a 3D point set by folding multiple planar "patches".Each patch is represented as a set of 2D points uniformly scattered on a 2D plane.Multi-Layer Perceptrons (MLPs), whose parameters are not shared with each other, are used to fold their corresponding patches.The patch-folding decoder is good at reconstructing smooth and continuous surfaces of 3D objects since the coordinates of the output 3D points are constrained by the input 2D points on the patches.However, we argue that the existing patch-folding decoders have two drawbacks.First, the constraints imposed by 2D planar patches is often too restrictive, impeding accurate reconstruction of local 3D geometry.Figure 1a exemplifies the 3D point set reconstructed by AtlasNet [12], which is one of the representative patch-folding decoders.In Fig. 1a, the leg part of the reconstructed 3D points approximate smooth surface(s), but its local 3D geometry is inaccurate.Second, the MLPs, which independently fold multiple patches, ignore the global relationship, or consistency, among the patches.This could lead to cracks on object surfaces or overlapping between the patches.
The CD is a kind of set-to-set distance.It is calculated as the sum of the positional displacements (e.g., Euclidean distances) of the corresponding 3D point pairs formed between the reconstructed 3D point set and the groundtruth 3D point set.The CD is widely adopted as a loss function to train DNNs for 3D point set reconstruction since it is computationally efficient compared to the other set-to-set distances (e.g., Earth mover's distance).However, DNNs trained by the CD loss tend to generate "ambiguous" 3D shapes [9,14].Figure 1b exemplifies the reconstruction result having ambiguous 3D point distribution.In Fig. 1b, the seat and backrest, which are the parts geometrically stable across many chair samples in the training dataset, are successfully reconstructed with small positional displacements.On the other hand, the 3D points in the leg part, whose shape varies widely from a chair sample to another, have larger positional displacements that should lead to strong impetus for the decoder to learn the geometry.
However, notice also that the geometrically stable (i.e., succeeded to reconstruct) parts have higher point density The existing patch folding decoders often fail to reconstruct detailed local 3D geometry and tend to produce "blurred" or smoothly distributed 3D points.We suspect that such "blurred" reconstruction stems from using low-dimensional (2D) input patches.b The Chamfer Distance loss is dominated by displacements of 3D points in successfully reconstructed parts (circled in green), which are geometry stable across data samples.Local geometry having high variability across data samples, which tend to have large displacements, are almost ignored in the loss than the geometrically variable (i.e., failed to reconstruct) parts.Consequently the overall CD loss is dominated by the displacements of successfully reconstructed parts having higher number of points, while the displacements computed in unsuccessfully reconstructed parts having smaller number of points are nearly ignored.Training with such a loss does not lead to effective training for the decoder.The decoder DNN after training thus generates 3D point sets having ambiguous or "blurred" local geometry.
To alleviate the abovementioned issues of the patchfolding decoder and the CD loss, we propose a novel decoder DNN architecture and a novel loss function for 3D point set reconstruction.Figure 2 illustrates our proposed algorithm.Our decoder DNN, called Hyperplane Mixing and Folding Net (HMF-Net), employs multiple hyperplane patches as its input representation.Hyperplane patches relax the constraints of the folding process since their dimensions are higher than the 2D plane patches used by the existing patch-folding decoders.In addition, to learn the consistent relationship among the hyperplane patches, HMF-Net processes the patches by using novel DNN layers, called Patch Mixing and Folding (PMF) blocks.The architecture of the PMF block is built upon the token-mixing DNN The strength of the HMF-Net decoder and the WCD loss is their versatility.That is, HMF-Net decoder can be coupled with encoder DNNs that process various data types (e.g., 3D point sets or 2D images).HMF-Net is thus applicable to a wide range of 3D point set reconstruction tasks.The WCD loss can also be used to train any encoder-decoder DNN for 3D point set reconstruction since the WCD loss can be a drop-in replacement for the CD loss.
Experimental evaluation demonstrates the effectiveness of the proposed methods in the three tasks of 3D point set reconstruction, that are, shape completion, shape upsampling, and shape reconstruction from multiview 2D images.In all the three tasks, HMF-Net, trained by using traditional CD loss, yields shape reconstruction accuracy higher than the existing decoder DNNs we have compared against.We also show the WCD loss has a positive impact on improving shape reconstruction accuracy of both our HMF-Net and decoder DNNs found in previous work.
The contribution of this paper can be summarized as follows.
• Proposing the novel HMF-Net decoder and the novel WCD loss for 3D point set reconstruction.They work synergistically in improving the quality of reconstructed 3D point sets.
• Comprehensively evaluating the proposed algorithms in three 3D point set reconstruction tasks.The efficacies of both HMF-Net and WCD are verified in all the three tasks.
The rest of the paper is organized as follows.Section 2 reviews related work and Sect. 3 elaborates on our proposed algorithms.Section 4 reports experimental results.Finally, conclusion and future work are discussed in Sect. 5.

Deep learning for 3D point set analysis
In a 3D point set, data samples are an unordered set of 3D coordinate point.The points do not lie on rectilinear 3D grids, and there are no connectivity.Typically, the 3D points are distributed on the surface of an 3D object with irregular intervals to represent the shape of the 3D object.Therefore, standard convolution operations for the grid-structured data, e.g., 2D images and 3D voxels sampled on rectilinear grids points (typically having equal intervals), cannot be applied directly to 3D point sets.
PointNet [16] is the first-of-a-kind end-to-end DNN for 3D point set data.To handle the irregularity of 3D point distribution, PointNet extracts per-point features by using a parameter-shared MLP.To obtain permutation invariance of the 3D points, the set of per-point features is aggregated to a per-shape, global feature by using an orderless pooling operation.Following the success of PointNet, many studies proposed the DNNs for a wide range of 3D point set analysis tasks including classification/segmentation [17,18], retrieval [19], registration [20], pose estimation [21], and reconstruction [9,14].
As summarized in [1], the studies that followed PointNet have made progresses.Improvements in encoder DNN architecture are especially significant as they are essential for any 3D point set analysis task.To convolve a 3D point set, Li et al. [22] and Komarichev et al. [23] order 3D points in a local region and apply 1D convolution to the ordered 3D points.Tatarchenko et al. [24], Le et al. [25], and Su et al. [26] spatially quantize irregularly distributed 3D points into regular grids or lattice structures.The quantized sets of 3D points are encoded into local 3D shape features by using convolution filters having the same number of dimensions as the grid or lattice.Zhang et al. [27] and Wang et al. [18] construct a neighborhood graph structure by connecting neighboring 3D points and extract local 3D shape features by using the graph convolution [28].Recently proposed transformer-based 3D point set encoder DNNs [29,30] extract 3D shape features adaptively to the shape context of the input 3D point set by using the self-attention mechanism [31].
Our proposed HMF-Net decoder can potentially be combined with any of the encoders described above to form an encoder-decoder DNN for 3D shape reconstruction.

Deep learning-based 3D point set reconstruction
The DNNs for 3D point set reconstruction employ an encoder-decoder architecture.When both the input and output for the DNN are 3D point sets, such DNN is called an autoencoder.The encoder-decoder DNN comprises a pair of an encoder DNN and a decoder DNN.The encoder DNN maps the input data sample to its latent feature, and the decoder DNN takes as its input the latent feature and tries to reconstruct the original 3D point set.The encoder-decoder DNN is trained by using a loss function that measures discrepancy between the reconstructed 3D shape and the groundtruth 3D shape.As we mentioned in Sect.2.1, various powerful encoder DNNs have been proposed.In contrast, however, decoder DNN and loss function have not been fully studied.
Recently proposed state-of-the-art DNNs for 3D point set reconstruction employ a task-specific DNN architecture and/or training framework.For the task of shape completion, skip connections between the encoder and the decoder are widely adopted to improve reconstruction accuracy [4,32,[52][53][54].Tang et al. [55] proposes to detect keypoints and surface-skeleton from an input incomplete 3D point set to realize topology-aware shape completion.For the task of shape upsampling, a majority of the methods (e.g., [5-7, 56, 57]) decompose an input sparse 3D point set into multiple local point sets and upsample each local point set independently by using a DNN.For the task of shape reconstruction from 2D image(s), several state-of-the-art methods (e.g., [58,59]) employ a DNN architecture and a training framework that effectively associate latent visual features with 3D geometric features.
Since the abovementioned state-of-the-art DNNs for 3D point set reconstruction are highly tailored to their corresponding shape reconstruction task, they lack versatility.For example, the DNNs designed for the shape completion task [4,32,[52][53][54][55] are difficult to apply directly to the tasks of shape upsampling and shape reconstruction from 2D images.Compared to the previous studies that aim at a specific shape reconstruction task, this paper focuses on decoder DNNs and loss functions that are applicable to a wide range of 3D shape reconstruction tasks.

Full-connection decoder
The decoder proposed by Achlioptas et al. [14] employs a stack of fully connected, or MLP, layers.Tchapmi et al. [3] employ multiple MLPs arranged in a tree structure to form their decoder DNN.The architectures of these DNNs are simple and their training is stable.However, these decoder DNNs are not adept at reconstructing smooth object surfaces or detailed local 3D geometry since their MLP-based architectures do not have inductive biases for 3D shape reconstruction.

Deconvolution decoder
The decoders designed by Li et al. [33], Yang et al. [34], and Hui et al. [35] adopt the layers having an inductive bias for 3D shape reconstruction.That is, [33,34] employ 2D deconvolution layers to transform adjacent pixels on a feature map to neighboring 3D points in the 3D space.[35] employs graph deconvolution to constrain the neighboring points in the decoder's latent feature space so that they also become neighbors in the output 3D point set.These decoder DNNs are better suited for reconstructing local 3D geometry.However, they often face training instabilities due to the complex architecture of the decoder.Yang et al. [11], Yuan et al. [2], Groueix et al. [12], and Deprelle et al. [13] proposed the decoder DNNs that have an explicit inductive bias for reconstructing object surfaces.In these decoders, the latent feature vector extracted by the encoder DNN is duplicated and then concatenated with 2D points sampled on 2D planar patches.[11] uses a single patch while [2,12,13] use multiple (e.g., 64) patches.The decoder DNN consists of a set of MLPs whose parameters are not shared with each other.Each MLP transforms, or folds, a set of feature points belonging to the same patch into a set of 3D points.The patch-folding decoders are good at approximating 2-manifolds, or smooth surfaces, of 3D objects since the input 2D points on the patch act as constraints on the coordinates of the output 3D points.However, as shown in Fig. 1a, the smooth 3D point distribution is not suitable for representing partial 3D shapes having detailed and complex structures.We suspect that the strong constraint imposed by the low-dimensional planar patches interferes with generating detailed local 3D shapes.In each patch, differences in the coordinates of generated 3D points stem from differences in the input 2D coordinates on the patch.It is difficult for the decoder to generate a complex local 3D geometry from differences in a mere two coordinate values.We also suspect that the parameterunshared MLPs, which fold patches independently, ignore global interrelationships between the patches so that the quality of reconstructed point sets becomes lower.

Patch-folding decoder
In this paper, we relax the constraints imposed on the folding process by increasing the dimensionality of the planar patches.Also, we leverage the token-mixing DNN [15] to learn the global consistency between the patches.

Loss function for 3D point set reconstruction
Either the Chamfer Distance (CD) or the Earth Mover's Distance (EMD) is typically used as the loss function to train the DNN for 3D point set reconstruction.In both CD and EMD, the overall loss is computed as the sum of positional displacements (e.g., Euclidean distances) between corresponding 3D point pairs.The corresponding 3D point pairs are formed between the groundtruth 3D point set and the 3D point set produced by the decoder.In CD, each point pair is formed between a 3D point within one point set and its closest 3D point within the other point set.In EMD, the point pairs are formed by computing bijective mapping between the two 3D point sets.Many studies on 3D point set reconstruction prefer CD since it is more computationally efficient than EMD.
However, as we discussed in Sect. 1, the decoder trained with the CD loss tends to generate 3D point sets having "blurred out" or ambiguous local geometrical details (e.g., the leg of the chair shown in Fig. 1b).We observe high concentration of reconstructed points in the successfully reconstructed shape regions having stable shape across training shape samples.And these points, by their majority, dominates the overall CD loss.Such a CD loss does not facilitate the learning of geometrically variable local geometry.As a result, the DNN training converges to a local minima that generates ambiguous local geometrical shapes.In this paper, we change the CD loss to emphasize larger displacement values in corresponding 3D point pairs.This change should expose parts having unsuccessful reconstruction to help the DNN find better local minima.

Token mixing DNN architecture
Recently, Transformer [31,36] and Token-mixer [15,[37][38][39] have become popular DNN architectures in the fields of natural language processing as well as 2D image analysis.Both architectures are capable of learning relationships between elements in a set, or tokens.In the field of 2D image recognition, Transformer and Token-mixer show accuracy comparable to the widely used Convolutional Neural Network (CNN), given a large amount of training samples.
MLP-Mixer [15] is one of the first-of-a-kind Tokenmixing DNNs designed for 2D image classification.Compared to CNN and Transformer, the architecture of MLP-Mixer is simple since it has only MLPs as its building blocks.Despite its simplicity, MLP-Mixer achieves 2D image classification accuracy comparable to CNN and Transformer.MLP-Mixer consists of two types of MLPs, that are, tokenmixing MLP and channel-mixing MLP.The token-mixing MLP mixes a set of image patch features, or tokens, by using weights indicating the relevance between the tokens.The channel-mixing MLP transforms the mixed features generated by the token-mixing MLP to refine the features.The other Token-mixers [37][38][39] also adopt a DNN architecture similar to MLP-Mixer.
In the field of 3D point set analysis, the token-mixing DNNs have not yet been fully explored.Very recently, an MLP-Mixer-based DNN for 3D point set analysis has been proposed by Choe et al. [40].They adopt the MLP-Mixer as an encoder DNN to learn relationship among 3D points of an input 3D point set.Our study differs from [40] in that we employ MLP-Mixer in the decoder DNN to learn the relationship between point set patches for accurate 3D point set reconstruction.

Overview of the proposed algorithm
For accurate 3D point set reconstruction, we propose a novel decoder DNN called Hyperplane Mixing and Folding Net (HMF-Net) and a novel loss function called Weighted Chamfer Distance (WCD).Figure 2 illustrates the overview of the proposed framework in the case of point set completion task.Note that our algorithm is applicable to diverse 3D point set reconstruction tasks including shape completion, shape upsampling and shape reconstruction from 2D image(s).
Our entire DNN has the encoder-decoder architecture.The encoder DNN maps the input data sample, e.g., incomplete 3D point set, to its latent feature representation.The HMF-Net decoder takes as its input the latent feature and generates the output 3D shape, e.g., completed 3D point set.HMF-Net leverages multiple high-dimensional planar patches, i.e., hyperplane patches, for accurate 3D shape reconstruction.High-dimensional (e.g., 5D) points sampled on a hyperplane are concatenated with the latent feature to form a set of pointwise features per hyperplane.
The sets of pointwise features are transformed to the sets of pointwise 3D coordinates by using the MLP-based DNN.Our MLP-based DNN includes the Mixing and Folding (MF) blocks, which is built upon the token-mixing DNN [15].The MF blocks allow the multiple hyperplanes to communicate with each other and to collaboratively generate the globally consistent output 3D shape.
The entire DNN is trained in an end-to-end manner by using our WCD loss.WCD weights the distances among corresponding 3D point pairs formed between the output 3D point set and the groundtruth 3D point set.In WCD, small distances are deemphasized, or down-weighted, and large distances are emphasized, or up-weighted, so that the DNN can focus on the point pairs that have large reconstruction errors.

Hyperplane patches
We introduce hyperplane patches to relax stronger-thannecessary constraints the 2D planar patches impose on the patch-folding process.A hyperplane is a linear subspace whose number of dimensions is one less than that of its ambient space.If the number of dimensions for the ambient space is N a , the points sampled on a hyperplane in the N a -dimensional ambient space lie on the (N a − 1)dimensional linear subspace.The value N a controls strength of the constraints imposed on the patch-folding process.Our experiments show that N a around 5 to 8 improves the accuracy of 3D point set reconstruction.We use multiple (N h ) patches to reconstruct a 3D point set.N h is fixed at 64 throughout this paper.
A hyperplane patch is created by the following procedure.We first randomly generate N s points in the N a -dimensional ambient space.For each axis, coordinate values for the N s points are randomly chosen from the uniform distribution U(− 1, 1).We then randomly choose a (N a − 1)-dimensional hyperplane that passes through the origin.The N s points are projected onto the hyperplane to form a set of points on the hyperplane patch.We iterate the above procedure N h times to create N h hyperplane patches, using the distinct hyperplane each time.The procedure for creating hyperplane patches is summarized in Algorithm 1.
The number of points per patch N s depends on the number of hyperplane patches N h and the desired number of output 3D points per shape N p .Similar to [11,12], N s is computed by (1).
For example, when we use N h 64 patches and want to generate N p 2048 3D points per shape, N s becomes 36.Therefore, our decoder produces N h × N s 2304 3D points per shape.The number of the produced 3D points is reduced to match the desired number by subsampling, whose detail is given in the next subsection.

Patch deformation using mixing and folding blocks
Coordinate values, having N a -dimension, of a point sampled on the hyperplane is concatenated with the latent 3D shape feature extracted by the encoder DNN to form the pointwise feature to be processed by the decoder.In this paper, we fix the number of dimensions N d for the latent feature space at 512.Therefore, each pointwise feature after the concatenation has (512 + N a ) dimensions.Pointwise features are grouped by the hyperplane patches they sit on to form N h sets, each set containing N s elements.The N h sets of pointwise features, having (512 + N a ) dimension, are regressed to the N h sets of pointwise 3D coordinates by our MLP-based DNNs.Our MLP-based DNN comprises a series of multiple (e.g., two) Mixing and Folding (MF) blocks sandwiched between two Folding-only blocks.

Folding-only block
The Folding-only block is identical to the folding layer employed by the existing patch-folding decoders [12,13].That is, each set of pointwise features is transformed by a fully connected (FC) layer.As in [12,13], we use N h different FC layers to independently fold the N h patches.In the first Folding-only block, each FC layer has 512 output neurons activated by the ReLU function [41].In the second Folding-only block, which is located at the output end of HMF-Net, each FC layer has three output neurons that represent x-, y-, and z-coordinate.The 3D coordinates are activated by the hyperbolic tangent function to approximately fit the output 3D point set into a unit sphere.
Mixing and folding (MF) block Figure 3 illustrates the processing in a MF block.Each MF block takes as its input the N h sets of pointwise features produced either by the first Folding-only block or the preceding MF block.Each set of pointwise features is first aggregated by max-pooling to create a 512D patchwise feature.The N h patchwise features are treated as "tokens", and they are mixed with each other by using MLP-Mixer [15].Specifically, we form the feature matrix X ∈ R Nh×512 by stacking the N h patchwise features.The matrix X is processed by the token-mixing MLP followed by the channel-mixing MLP.The token-mixing MLP in this paper is formulated by (2).In (2), W 1 ∈ R Nh×4Nh and W 2 ∈ R 4Nh×Nh are the learnable parameter matrices of the token-mixing MLP.σ is the Gaussian Error Linear Unit (GELU) activation function [42].
The patchwise features processed by the token-mixing MLP, i.e., U, are further transformed by the channel-mixing MLP.The channel-mixing MLP is formulated by (3).In (3), W 3 ∈ R 512×512 and W 4 ∈ R 512×512 denote the learnable parameters of the channel mixing MLP.We use the GELU function as σ in (3).
The output matrix Y has the patchwise features mixed by MLP-Mixer.Each mixed patchwise feature is duplicated N s times and then added to each one of the input pointwise features.The pointwise features after the addition reflect the relationship among the patches.The sets of features are further folded by the FC layer with the ReLU activation to obtain the output sets of pointwise features.The sets of features output from the MF block are used as inputs to either the subsequent MF block or the second output-end Folding-only block.
Postprocessing We obtain the output 3D point set by taking union of the N h sets of pointwise 3D coordinates produced by the second Folding-only block.As we mentioned in Sect.3.2.1, the number of 3D points produced by HMF-Net, i.e., N h × N s , can differ from the desired output number of 3D points (i.e., N p ).When N h × N s is larger than N p , we subsample the generated 3D points by using the Farthest Point Sampling (FPS) technique.We have two reasons for subsampling the 3D point set by FPS.The first reason is to obtain 3D points that distributes uniformly, in terms of density, on the surface of a 3D object.The ablation study conducted in Sect. 4 shows superiority of FPS over the random point sampling.The second reason is to ensure a fair comparison among the decoder DNNs by aligning the numbers of output 3D points.

Weighted chamfer distance loss
The loss for 3D point set reconstruction is computed between the generated 3D point set P and the groundtruth 3D point set Q.In our case, P corresponds to the output 3D point set subsampled by FPS.Equation (4) shows the Chamfer Distance (CD), which is commonly used as a loss function for 3D point set reconstruction.In CD, a corresponding point pair is formed between a 3D point x in one 3D point set and its closest 3D point y in the other 3D point set.The previous studies typically use the Euclidean distance as d (x, y) to measure a positional displacement between x and y.
In this paper, we propose to weight the Euclidean distances so that the DNN training can focus on reconstructing parts having higher point-pair distance values, that are, parts failed to reconstruct, as exemplified in Fig. 1b.Inspired by the formulation of the Focal loss proposed by Lin et al. [43], we use the following equation to weight the Euclidean distance of each 3D point pair (x, y).
In (5), division by two normalizes the Euclidean distance between the two 3D points, whose coordinates are fit within a unit sphere, into the range [0, 1].The exponent γ controls the degree of down-weighting for the distances.The  4) to obtain the Weight Chamfer Distance (WCD) loss.Figure 4 shows the relationship between the normalized Euclidean distance and the loss, i.e., the d w (x, y) computed by (5).Compared to CD, WCD gives a smaller loss for a small distance and a larger loss for a large distance.We expect that this distance weighting, or emphasis of larger distance, allows the DNN via backpropagation to focus on learning of unsuccessful reconstructions that cause large positional displacements.

Experimental setup
We verify the effectiveness of the proposed HMF-Net decoder DNN and the proposed WCD loss function for 3D point set reconstruction.To demonstrate the versatility of the proposed methods, we use three 3D point set reconstruction related tasks, that are, shape completion, shape upsampling, and shape reconstruction from multiview 2D images.

Shape completion
We use the dataset created by Tchapmi et al. [3], which is a subset of the ShapeNet corpus [44] [10].In [10], each polygonal 3D shape is rendered from five viewpoints chosen randomly.We use these five 2D images per shape to form an input set of 2D images.Each 2D image has 137 × 137 colored pixels and each groundtruth 3D point set has 2,048 3D points.
Evaluation measures In all the three reconstruction tasks, accuracy of 3D point set reconstruction is measured in CD and EMD.The smaller CD/EMD value indicates the more accurate reconstruction since they measure the error, or discrepancy, between the output and groundtruth 3D point sets.
To compute EMD, we use the implementation provided by Qi et al. [16].
The fully connected (FC) decoder [14] and the TopNet decoder [3] transform the latent feature extracted by the encoder DNN to a 3D point set by using an MLP.The 2D deconvolution (2D deconv.)decoder [11], the SONet decoder [33] and the Progressive Seed Generation (PSG) decoder [34] employ a stack of 2D deconvolution layers to generate 3D point sets.The Progressive Deconvolution Generation Net (PDGN) decoder [35] generates 3D point sets by using the graph deconvolution layers.The FoldingNet decoder [11], the Point Completion Network (PCN) decoder [2], the Atlas-Net decoder [12], and the AtlasNetV2 decoder [13] fold, by using MLP(s), 2D planar patch(es) associated with the latent feature to generate a 3D point set.

Implementation details
For a fair comparison, we use the same encoder DNN, data preprocessing, data augmentation, and hyperparameter settings among all the decoder DNNs we compare.Codes for our experiments are implemented by using Python with the TensorFlow library [46].We mainly use a PC having an AMD Ryzen 9 5900X CPU, 64 GByte main memory, and an Nvidia RTX TITAN GPU.
Encoder DNN For the shape completion task and the shape upsampling task, we use PointNet [16] as an encoder DNN.We omit the T-Net branch in PointNet for normalizing rotation of 3D shapes since orientations of the 3D shapes used in our experiments are consistently aligned.For the task of reconstruction from multiview 2D images, we adopt Multi-View CNN (MVCNN) [47] as an encoder.As a backbone 2D CNN for MVCNN, we utilize ResNet18 [48] pretrained by using the ImageNet1K dataset [49].For all the above encoder DNNs, the number of dimensions N d for the latent feature space is fixed at 512.The 512D latent feature vector is normalized by its L2 norm and then fed into the decoder DNN.

Data preprocessing
For each 3D point set, we normalize its position and scale.Specifically, a 3D point set is first translated so that its gravity center coincides with the coordinate origin of the 3D space.The translated 3D point set is then scaled to be enclosed by a unit sphere.Pixels of each 2D image are standardized by using the mean and variance of all the pixel values in the ImageNet1K dataset.

Data augmentation
To diversify the training data, each preprocessed training sample is augmented before fed into the encoder DNN.For the shape completion task, we use the data augmentation recommended by [3].That is, a 3D point set is first randomly rotated about the y-axis which corresponds to the upright direction of the 3D object.The rotated 3D point set is then mirrored in the xy-plane with a probability of 0.5.For the tasks of shape upsampling and shape reconstruction from 2D images, each 3D point set or 2D image is augmented with the probability of 0.8.To augment 3D point sets, we use anisotropic scaling followed by translation.The scaling factor for each axis is randomly sampled from the uniform distribution U(0.9, 1.1), and the amount of translation along each axis is chosen from U(− 0.1, 0.1).A 2D image is augmented by cropping and horizontal flipping.The 2D image is isotropically scaled by the factor 1.2, and crop a rectangle having the original image size at a random position.The cropped image is then horizontally flipped with the probability of 0.5.
Hyperparameter settings Learnable parameters of the encoder DNNs (except for the ResNet18 backbone) and the decoder DNNs are initialized by using the algorithm by He et al. [50].We use the Adam optimizer [51] with the initial learning rate 10 −4 .A minibatch consists of 16 samples.In each experiment, we iterate the DNN training for 300 epochs and a minimum CD/EMD value obtained during the 300 epochs is used as the shape reconstruction error for the experiment.We repeat each experiment three times and report their average shape reconstruction error in the next section.We use CD as the loss function to train the competitor decoders.Our HMF-Net is trained by using either CD or WCD.

123
We use the following hyperparameters for HMF-Net unless otherwise stated.We set the dimensionality of the ambient space for hyperplane N a at 5, the number of MF blocks at 2, the number of hyperplane patches N h at 64. AtlasNet and AtlasNetV2 in our experiments also employ 64 2D plane patches for a fair comparison.

Comparison with existing decoder DNNs
Quantitative evaluation Table 1 compares shape reconstruction errors of the 11 decoder DNNs in the tasks of completion, upsampling, and reconstruction from 2D images.Tables 2 and 3 show per-category reconstruction errors for the shape completion task.Note that the CD/EMD values shown in the tables are multiplied by 10 3 .That is, actual numbers are 1/10 3 .All the decoder DNNs except for "HMF-Net + WCD" are trained by using the CD loss, and "HMF-Net + WCD" is trained by using the WCD loss with γ 0.
In most of the cases in Tables 1, 2 and 3, our HMF-Net trained with CD outperforms the existing decoder DNNs.This result verifies the effectiveness of the architecture of the proposed HMF-Net decoder.The impact of each component of HMF-Net on reconstruction accuracy is evaluated in the ablation study conducted in the next subsection.Among the existing decoders we have compared, the patch folding decoders (i.e., PCN, AtlasNet, and AtlasNetV2), SONet, and the state-of-the-art PSG-Net perform well.
Our full model, i.e., "HMF-Net + WCD", improves shape reconstruction accuracy of HMF-Net trained with CD as demonstrated in Tables 1, 2 and 3.The WCD loss is effective in reducing reconstruction errors measured both in CD and EMD.WCD's emphasis of larger Euclidian distance values among corresponding 3D point pairs facilitates the DNN to find better solutions for 3D shape reconstruction.Note that the WCD loss can be used as a training loss for the other decoder DNNs.Section 4.2.2 evaluates the effect of the WCD on the other decoder DNNs.
Qualitative evaluation Figure 5, 6, and 7 exemplify the results of 3D point set reconstruction in the task of shape completion, shape upsampling, and shape reconstruction from 2D images, respectively.Especially in the shape completion task and the upsampling task, our HMF-Net generates 3D point sets having rich local 3D geometry compared to the existing decoder DNNs.For example, in Fig. 5, the wings of an airplane, the wheels of a car, and the legs of chairs generated by HMF-Net have 3D shapes similar to their corresponding parts in the groundtruth 3D point sets.While global shapes of the 3D point sets generated by the competitors are approximately correct, their local geometrical details are not reconstructed well.Similarly, in Fig. 6, HMF-Net succeeds in upsampling the wings of an airplane and the shelf boards of a bookshelf better than the competitors.In terms of shape reconstruction from 2D images in Fig. 7, all the decoder DNNs including our HMF-Net tend to generate "blurry" 3D point sets.This is probably because we chose the challenging experimental setting where the encoder DNN receives only five, inconsistent 2D images rendered from random viewpoints.Yet, our HMF-Net, as well as PSG-Net, can generate 3D point sets closer to their groundtruths than FoldingNet and AtlasNet.
Note that HMF-Net, as well as its competing decoders, reconstructs 3D point sets by relying on the latent 3D shape features extracted by the encoder DNN, i.e., PointNet in this experiment.While PointNet is computationally efficient, it is not the most effective, by current standard, in extracting local 3D shape features from the input 3D point set.It is then worth noting that, despite using the PointNet, our HMF-Net successfully reconstructs 3D point sets having accurate local 3D shapes as exemplified in Figs. 5 and 6.Improving the latent 3D shape feature by using the more advanced encoder DNN (e.g., [17,18,29,30]) would improve the quality of shape reconstruction of HMF-Net.Or, using skip connections between the encoder DNN and the HMF-Net decoder may further improve reconstruction accuracy since the skip connections enrich information that the decoder receives.
Note, however, that strongly coupling the encoder and the decoder by using skip connections also reduces versatility of the DNN in 3D point set analysis tasks.For example, the DNNs [4,32,[52][53][54] that employ skip-connections are highly dedicated to the shape completion task and are difficult to use for the tasks of upsampling or reconstruction from 2D image(s).We leave utilizing the better encoder DNNs and the skip connections for future work since this paper focuses on decoder DNN and loss function that can be used for diverse tasks of 3D point set reconstruction.In the in-depth evaluation presented in the next subsection, we compare our HMF-Net with the DNNs designed for shape completion [52][53][54].

In-depth analysis of proposed algorithm
This subsection evaluates the proposed algorithm from the following four perspectives.
• The impact that the hyperparameters of the proposed algorithm have on the shape reconstruction accuracy.• The effectiveness of each component of the proposed algorithm, which is evaluated through an ablation study.• The generality of the proposed approach, which is evaluated by applying the hyperplane patch representation and the WCD loss to the existing decoders.• A comparison with the existing state-of-the-art DNNs designed for 3D point set completion.

Hyperparameters for HMF-Net
We verify the effectiveness of using hyperplane patches.To do so, we investigate the relationship between the dimensionality of the ambient space for patches, i.e., N a , and the shape reconstruction error.When N a is 2, each patch is represented as a 1D line embedded in the 2D ambient space.When N a is 3, each patch is a 2D plane embedded in the 3D ambient space.Such 2D planar patches are employed by the existing patch-folding decoder DNNs.When N a is equal to or larger than 4, the patches are represented as (N a − 1)-dimensional hyperplanes.The WCD loss function is used for these experiments.Figure 8a plots shape reconstruction errors measured in CD and EMD against N a .Although the numerical difference is not very large, hyperplane patch with N a from 4 to 8 yields shape reconstruction errors lower than N a 3 when measured in CD.Similar results are observed in EMD for the shape completion task.These results suggest that the use of hyperplane patches relax constraint the 2D plane patches impose on the patch folding process, and thus improved shape reconstruction results.Figure 8a also shows that using too high than optimal dimension for hyperplane patches hampers accurate shape reconstruction.We suspect that an excessively high hyperplane dimension overly relaxes constraint on patch-folding process, making the decoder less effective in reconstructing smooth surfaces in objects.
We next investigate the influence that the number of MF blocks has on the reconstruction results.Figure 8b plots shape reconstruction errors against the number of MF blocks.As

Input 3D point set
Completion by PSG-Net [34] Completion by FoldingNet [11] Completion by AtlasNet [12] Completion by HMF-Net + WCD (ours) Groundtruth 3D point set shown in the plots, the appropriate number of MF blocks depends on the reconstruction task, and probably also on the dataset.In the shape completion task, the lowest CD is obtained when we use two MF blocks.In the shape upsampling task, more MF blocks are required to achieve the best reconstruction accuracy.The results verify the positive  In the table, the components adopted by our proposed method are colored in red.CD and EMD are evaluated in the shape completion task effect that the token-mixing DNN has on learning globally consistent relationships among the hyperplane patches.The effectiveness of using hyperplane patches and MF blocks will also be demonstrated later in the ablation study.
Hyperparameter for WCD The proposed WCD loss has the hyperparameter γ that controls the degree of emphasis of distances among corresponding 3D point pairs.Figure 8c shows the relationship among values for γ and shape reconstruction error.When γ lies between 0 and 0.5, training with the WCD loss yields almost the same reconstruction errors.Therefore, we suggest that the value for γ is simply set at 0. When γ becomes larger than 0.5, reconstruction error significantly increases.As we showed in Fig. 4, too large γ (e.g., γ 1) assigns almost no penalty to small Euclidean distances among corresponding 3D point pairs.With such distance weighting, DNN training for 3D point set reconstruction does not progress effectively.

Ablation study
We conduct the ablation study to evaluate the influence that each component of the proposed method has on shape reconstruction accuracy.We use the shape completion task in this ablation study.In it we evaluate the four components, that are, input patch representation, layer for patch processing, subsampling method applied to an output 3D point set, and loss function.Table 4 shows the effectiveness of each component.In the column "Patch processing layer" in Table 4, the model A, "F-MF-MF-F" is the proposed patch processing method as described in Sect.3. Here, two "MF"s or MF blocks, are sandwiched between two "F"s, or Folding-only blocks.The degenerate model F designated by "F-F-F-F" corresponds to AtlasNet [12].It consists of four Folding-only ("F") blocks, without MF blocks.Compared to the model A in  verify the positive impact of the proposed components on the reconstruction results.Figure 9 visually evaluates the effectiveness of our main proposals in this paper, that are, the HMF-Net decoder and the WCD loss.As shown in the figure, ablating either the HMF-Net decoder or the WCD loss deteriorates the quality of shape completion.For example, in the case of the lamp shape, ablating one of the proposed components leads to inaccurate reconstruction of the lampshade part and the hanging cord part.As for the table shape in Fig. 9, our full model "HMF-Net + WCD" succeeds, to a certain extent, in reconstructing the leg opposite to the one included in the input point set, while the ablated methods fail.

Effect of the proposed ideas on existing decoders
The idea of using hyperplane patches as input to 3D point set decoders is applicable to the other patch-folding decoder DNNs.Also, the proposed WCD loss can potentially be used to train any DNNs for 3D point set reconstruction.To demonstrate the generality of these proposed approaches, we apply them to the existing DNN decoders.As with the experiments above, shape reconstruction accuracy is evaluated by using the completion task, and we use N a 5 for the hyperplane patches and γ 0 for the WCD loss.Table 5 shows how our proposed approaches influence on shape completion accuracy of the existing decoders.For all the four decoders we have experimented, CD and EMD decrease by using our proposed input patch representation and/or loss function.Considering The CD and EMD values are evaluated in the shape completion task.Our proposed components are colored in red the results in Tables 4 and 5, our ideas, i.e., folding hyperplane patches and weighting point pair distances, are advantageous in facilitating the DNNs to find better solutions for 3D point set reconstruction.
Comparison with the state-of-the-art DNNs for shape completion Table 6 compares our algorithm against the recently proposed algorithms for the shape completion task, i.e., PMP-Net [52], SnowflakeNet [53], and PoinTr [54].As an overall tendency, Table 6 shows that the state-of-the-art DNNs outperform our algorithm.This result is not a surprise since these completion-oriented DNNs employ mechanisms designed specifically for accurate shape completion.That is, all the DNNs [52][53][54] have skip-connections between an encoder DNN and a decoder DNN, and utilize an advanced encoder DNN that is adept at extracting local 3D geometric features.
The high accuracies of these completion-oriented DNNs imply that our algorithm still has the potential to improve shape reconstruction accuracy by introducing, for example, skip-connections and/or an advanced encoder DNN.

Conclusion and future work
This paper proposed a novel 3D point set decoder and a novel loss function for deep learning-based 3D point set reconstruction.Our 3D point set decoder DNN, called Hyperplane Mixing and Folding Net (HMF-Net), employs hyperplane patches as its input representation.The use of hyperplane relaxes exceedingly strong constraint planar (2D) input patches have on the patch folding process.Consistent global relationship among multiple hyperplane patches is effectively learned by a stack of the Mixing and Folding blocks, which are built upon the token-mixing DNN [15].We also introduced the Weighted Chamfer Distance (WCD) loss to facilitate our DNN to find a better shape reconstruction solution that pays attention to local shape details.We comprehensively evaluated the proposed methods under three 3D point set reconstruction scenarios, that are, shape completion, shape upsampling, and shape reconstruction from multiview 2D images.The experimental results demonstrated that both HMF-Net and WCD have positive impacts on improving accuracy of shape reconstruction.Future work includes further improving the accuracy of 3D point set reconstruction.This would require improvements of the entire learning framework for 3D point set reconstruction, including, but not limited to, training dataset acquisition, DNN architecture, and loss function.As we discussed in Sect.4.2.1, using the more advanced encoder DNN (e.g., [17,18,29,30]) to obtain better latent 3D shape features or introducing skip connections between the encoder and the decoder would be some of the possible approaches.Regarding the loss function, applying our idea of weighting distances to the other set-to-set distances (e.g., Earth Mover's Distance) would have a potential for improving shape reconstruction accuracy.

Fig. 1
Fig.1Motivation for our proposed methods.a The existing patch folding decoders often fail to reconstruct detailed local 3D geometry and tend to produce "blurred" or smoothly distributed 3D points.We suspect that such "blurred" reconstruction stems from using low-dimensional (2D) input patches.b The Chamfer Distance loss is dominated by displacements of 3D points in successfully reconstructed parts (circled in green), which are geometry stable across data samples.Local geometry having high variability across data samples, which tend to have large displacements, are almost ignored in the loss

Fig. 2
Fig. 2 Overview of the proposed algorithm for 3D point set reconstruction.Hyperplane Mixing and Folding Net (HMF-Net) decoder leverages hyperplane patches to generate diverse local shapes having rich 3D

Fig. 3 Fig. 4
Fig.3Processing pipeline of Mixing and Folding (MF) block, which is the core building block of our HMF-Net.The MLP-Mixer enables the patchwise features to interact with each other so that a globally consistent set of patchwise features can be obtained.The FC layer further

Fig. 5 Fig. 8
Fig. 5 Examples of 3D point set reconstruction in the shape completion task.Compared to the competitor decoders, our HMF-Net decoder generates 3D point sets with accurate local 3D geometry

Fig. 9
Fig. 9 Visualization of effectiveness of the HMF-Net decoder and the WCD loss.Compared to our full model "HMF-Net + WCD", disabling one of its components deteriorates the quality of shape completion results

Table 1
ComparisonThe CD and EMD values in the table are multiplied by10 3

Table 2
Per-category shape completion errors measured in CD

Table 3
Per-category shape completion errors measured in EMDThe EMD values in the table are multiplied by10 3

Table 4
Ablation study of the proposed method

Table 4 ,
ablating each of the components (i.e., model B to model E) increases both CD and EMD for less accurate reconstructions, and ablating all the components as in model F shows the worst result in the table.These results

Table 5
Effectiveness of the proposed approaches when applied to the existing decoder DNNs

Table 6
Comparison of the proposed algorithm and the existing stateof-the-art DNNs designed for the shape completion task