A survey on deep geometry learning: From a representation perspective

Researchers have achieved great success in dealing with 2D images using deep learning. In recent years, 3D computer vision and geometry deep learning have gained ever more attention. Many advanced techniques for 3D shapes have been proposed for different applications. Unlike 2D images, which can be uniformly represented by a regular grid of pixels, 3D shapes have various representations, such as depth images, multi-view images, voxels, point clouds, meshes, implicit surfaces, etc. The performance achieved in different applications largely depends on the representation used, and there is no unique representation that works well for all applications. Therefore, in this survey, we review recent developments in deep learning for 3D geometry from a representation perspective, summarizing the advantages and disadvantages of different representations for different applications. We also present existing datasets in these representations and further discuss future research directions.

Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s. S e e h t t p://o r c a . cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s. Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Background
Recent improvements in methods for acquisition and rendering of 3D models have resulted in consolidated repositories on the Internet containing huge numbers of 3D shapes. With the increased availability of 3D models, we have been seeing an explosion in the demands of processing, generation, and visualization of 3D models in a variety of disciplines, such as medicine, architecture, and entertainment. Techniques for matching, identification, and manipulation of 3D shapes have become fundamental building blocks in modern computer vision and computer graphics systems. Due to the complexity and irregularity of 3D shape data, effectively representing 3D shapes remains a challenging problem. Thus, there have been extensive research efforts concentrating on how to deal with and generate 3D shapes in different representations.
In early research on 3D shape representations, 3D objects were normally modeled with a global approach, such as constructive solid geometry and deformed superquadrics. Those approaches have several drawbacks when utilized for tasks like recognition and retrieval. Firstly, when representing imperfect 3D shapes, including those with noise and incompleteness, which are common in practice, such representations may have a negative influence on matching performance. Secondly, the highdimensionality heavily burdens the computation and tends to make models overfit. Hence, more sophisticated methods are designed to extract representations of 3D shapes in a more concise, yet discriminative and informative form.
Several related surveys have been published [1][2][3], which focus on different aspects of deep learning for 3D geometry. Moreover, with rapid development of 3D shape representations and related techniques for deep learning, it is essential to further summarize up-to-date research. In this survey, we mainly review deep learning methods on 3D shape representations and discuss their advantages and disadvantages in different application scenarios. We now give a brief summary of different 3D shape representation categories.

Depth and multi-view images
Depth and multi-view images can be used to represent 3D models over a 2D field; the regular structure of images makes for efficient processing. Depending on whether depth is included, 3D shapes can be represented by RGB (color) or RGB-D (color and depth) images viewed from different viewpoints. Because of the influx of available depth data due to the popularity of 2.5D sensors, such as Microsoft Kinect, Intel RealSense, etc., multi-view RGB-D images are widely used to represent real-world 3D shapes. Large numbers of image-based models are available in this representation, but it is inevitable that such representations lose some geometric detail.

Voxels
A voxel is a 3D extension of the concept of pixel. Like pixels in 2D, the voxel-based representation also has a regular structure in 3D space. Architectures of various neural networks which have proved useful in the 2D image field [4,5] can be easily extended to voxel form. Nevertheless, adding one dimension means an exponential increase in data size. As resolution increases, the memory required and computational costs increase dramatically, which restricts the representation to low resolutions when representing 3D shapes.

Surfaces
Surface-based representations describe 3D shapes by encoding their surfaces, which can also be regarded as 2-manifolds. Point clouds and meshes are both discretized forms of 3D shape surfaces. Point clouds use a set of sampled 3D point coordinates to represent the surface. They can easily be generated by scanners but are difficult to process due to their lack of order and connectivity information. Researchers use order invariant operators such as the max pooling operator in deep neural networks [6,7] to mitigate the lack of order. Meshes can depict higher quality 3D shapes with less memory and computational cost compared to point clouds and voxels. A mesh contains a vertex set and an edge set. Due to its graphical nature, researchers have made attempts to build graph-based convolutional neural networks for coping with meshes. Some other methods regard meshes as the discretization of 2-manifolds. Moreover, meshes are more suitable for 3D shape deformation. One can deform a mesh model by transforming vertices while simultaneously retaining the connectivity.

Implicit surfaces
Implicit surface representation exploits implicit field functions, such as occupancy functions [8] and signed distance functions [9], to describe the surface of 3D shapes. The implicit functions learned by deep neural networks define the spatial relationship between points and surfaces. They provide a description with infinite resolution for 3D shapes with reasonable memory consumption, and are capable of representing shapes with changing topology. Nevertheless, implicit representations cannot reflect the geometric features of 3D shapes directly, and usually need to be transformed to explicit representations such as meshes. Most methods apply iso-surfacing, such as marching cubes [10], which is an expensive operation.

Structured representation
One way to cope with complex 3D shapes is to decompose them into structure and geometric details, leading to structured representations. Recently, increasing numbers of methods regard a 3D shape as a collection of parts and organize them linearly or hierarchically. The structure of 3D shapes is processed by recurrent neural networks (RNNs) [ 11], recursive neural networks (RvNNs) [ 12], or other network architectures. Each part of the shape can be processed by unstructured models. The structured representation focuses on the relations (such as symmetry, supporting, being supported, etc.) between different parts within a 3D shape, which provides a better descriptive capability than alternative representations.

Deformation-based representation
As well as rigid man-made 3D shapes such as chairs and tables, there are also a large number of non-rigid (e.g., articulated) 3D shapes such as human bodies, which also play an important role in computer animation, augmented reality, etc. Deformation-based representation is used mainly to describe intrinsic deformation properties while ignoring extrinsic transformation properties. Many methods use rotation-invariant local features for describing shape deformation to reduce distortion while retaining geometric details.

Geometry learning
Recently, deep learning has achieved superior performance to classical methods in many fields, including 3D shape analysis, reconstruction, etc. A variety of architectures of deep networks have been designed to process or generate 3D shape representations, which we refer to as geometry learning. In the following sections, we focus on the most recent deep learning based methods for representing and processing 3D shapes in different forms. Based on how the representation is encoded and stored, our survey is organized around the following structure: Section 2 reviews image-based shape representation methods. Sections 3 and 4 introduce voxel-and surface-based representations respectively. Section 5 further introduces implicit surface representations. Sections 6 and 7 review structure-and deformationbased description methods. We then summarize typical datasets in Section 8 and typical applications for shape analysis and reconstruction in Section 9, before concluding the paper in Section 10. Figure 1 provides a timeline of representative deep learning methods based on various 3D shape representations.

Image-based representations
2D images are projections of 3D entities. Although the geometric information carried by one image is incomplete, a plausible 3D shape can be inferred from a set of images with different perspectives.
The extra channel of depth in RGB-D data further enhances the capacity of image-based representations to encode geometric cues. Benefiting from the imagelike structure, research using deep neural networks for 3D shape inference from images started earlier than alternative representations that explicitly depict the surface or geometry of 3D shapes.
Socher et al. [33] proposed a convolutional and recursive neural network for 3D object recognition, which copes with RGB and depth images using single convolutional layers separately and merges the features with a recursive network. Eigen et al. [16] first proposed reconstructing a depth map from a single RGB image and designed a new scale invariant loss for the training stage. Gupta et al. [34]e n c o d e d the depth map into three channels including disparity, height, and angle. Other deep learning methods based on RGB-D images designed for 3D object detection [35,36] outperform previous methods.
Images from different viewpoints can provide complementary cues to infer 3D objects. Thanks to the development of 2D deep learning models, learning methods based on multi-view image representation perform better for 3D shape recognition than those based on other 3D representations. Su et al. [14] proposed MVCNN (multi-view convolutional neural network) for 3D object recognition. It processes the images for different views separately in the first part of the CNN, then aggregates the features extracted from different views by view-pooling layers, and finally sends the merged features to the remainder of the CNN. Qi et al. [37] proposed adding a multi-resolution strategy to MVCNN for higher classification accuracy.

Dense voxel representation
The voxel-based representation is traditionally a dense representation, which describes 3D shape data by a volumetric grid in 3D space. Each voxel in a cuboid grid records occupancy status (i.e., occupied or unoccupied).
One of the earliest methods to apply deep neural networks to volumetric representations, 3D ShapeNets, was proposed by Wu et al. [13]i n 2015. They assigned three different states to the voxels in the volumetric representation produced by 2.5D depth maps: observed, unobserved, and free.
3D ShapeNets extended the deep belief network (DBN) [38] from pixel data to voxel data and replaced fully connected layers in the DBN with convolutional layers. The model takes the aforementioned volumetric representation as input, and outputs category labels and predicted 3D shape by iterative computations. Concurrently, Maturana et al. proposed processing a volumetric representation with 3D convolutional neural networks (3D CNNs) [39] and designed VoxNet [40] for object recognition. VoxNet defines several volumetric layers, including an input layer, convolutional layers, pooling layers, and fully connected layers. Although these layers simply extend traditional 2D CNNs [4] to 3D, VoxNet is easy to implement and train, and gets promising performance as the first attempt at volumetric convolution. In addition, to ensure that VoxNet is invariant to orientation, Maturana et al. augmented the input data by rotating each shape into n instances with different orientations during training, and added a pooling operation after the output layer to group all predictions from the n instances during testing.
In addition to the development of deep belief networks and convolutional neural networks for shape analysis based on volumetric representation, two most successful generative models, namely auto-encoders and generative adversarial networks (GANs) [41]have also been extended to support this representation. Inspired by denoising auto-encoders (DAEs) [42,43], Sharma et al. [44] proposed an autoencoder model VConv-DAE to cope with voxels. It is one of the earliest unsupervised learning approaches for voxel-based shape analysis. Without object labels for training, VConv-DAE chooses mean square loss or cross entropy loss as the reconstruction loss function. Girdhar et al. [45] also proposed the TLembedding network, which combines an auto-encoder for generating a voxel-based representation with a convolutional neural network for predicting the embedding from 2D images.
Choy et al. [18] proposed 3D-R2N2 which takes single or multiple images as input and reconstructs objects within an occupancy grid. 3D-R2N2 regards input images as a sequence; its 3D recurrent neural network is based on LSTM (long short-term memory) [46] or GRU (gated recurrent units) [47]. The architecture consists of three parts: an image encoder to extract features from 2D images, 3D-LSTM to predict hidden states as coarse representations of final 3D models, and a decoder to increase the resolution and generate target shapes.
Wu et al. [19] designed a generative model called 3D-GAN that applies a generative adversarial network (GAN) [41] to voxel data. 3D-GAN learns to synthesize a 3D object from a sampled latent space vector z with probability distribution P (z). Moreover, Ref. [19] also proposed 3D-VAE-GAN inspired by VAE-GAN [48] for the object reconstruction task. 3D-VAE-GAN puts the encoder before 3D-GAN to infer the latent vector z from input 2D images, and shares the decoder with the generator of 3D-GAN.
After early attempts to use volumetric representations with deep learning, researchers began to optimize the architecture of volumetric networks for better performance and more applications. The motivation is that a naive extension from traditional 2D networks often does not perform better than image-based CNNs such as MVCNN [14]. The main challenges affecting performance include overfitting, orientation, data sparsity, and low resolution.
Qi et al. [37] proposed two new network structures aiming to improve the performance of volumetric CNNs. One introduces an extra task, predicting class labels for subvolumes to prevent overfitting, and another utilizes elongated kernels to compress the 3D information into 2D in order to use 2D CNNs directly. Both use mlpconv layers [49] to replace traditional convolutional layers. Ref. [37] also augments the input data using different orientations and elevations to encourage the network to obtain more local features in different poses so that the results are less influenced by orientation changes. To further mitigate the impact of orientation on recognition accuracy, instead of using data augmentation like Refs. [37,40], Ref. [50] proposed a new model called ORION which extends VoxNet [40] and uses a fully connected layer to predict the object class label and orientation label simultaneously.

Sparse voxel representation (octree)
Voxel-based representations often lead to high computational cost because of the exponential increase in computations from pixels to voxels. Most methods cannot cope with or generate high-resolution models within a reasonable time. For instance, the TL-embedding network [45] was designed for a2 0 3 voxel grid; 3DShapeNets [13]a n dVConv-DAE [44] were designed for a 24 3 voxel grid with 3 voxels padding in each direction; VoxNet [40], 3D-R2N2 [18], and ORION [50] were designed for a 32 3 voxel grid; 3D-GAN was designed to generate a 64 3 occupancy grid as a 3D shape representation. As the voxel resolution increases, the occupied voxels become sparser in the 3D space, which leads to more unnecessary computation. To address this problem, Li et al. [51] designed a novel method called FPNN to cope with data sparsity.
Some methods instead encode the voxel grid using a sparse, adaptive data structure, the octree [52]t o reduce the dimensionality of the input data. Häne et al. [53] proposed hierarchical surface prediction (HSP) which can generate a voxel grid in the form of an octree from coarse to fine. Häne et al. observed that only the voxels near the object surface need to be predicted at high resolution, allowing the proposed HSP to avoid unnecessary calculation for affordable generation of a high resolution voxel grid. Each node in the octree is defined as a voxel block with a fixed number (16 3 in the paper) of voxels of different sizes, and each voxel block is classified as occupied, boundary, or free. The decoder of the model takes a feature vector as input, and predicts feature blocks that correspond to voxel blocks hierarchically. The HSP defines that the octree has 5 layers and each voxel block contains 16 3 voxels, so HSP can generate up to a grid of up to 256 3 voxels. Tatarchenko et al. [54] proposed a decoder called OGN for generating high resolution volumetric representations. Nodes in the octree are separated into three categories: empty, full, or mixed. The octree representing a 3D model and the feature map of the octree are stored in the form of hashing tables indexed by spatial position and octree level. In order to process feature maps represented as hash tables, Tatarchenko et al. designed a convolutional layer named OGN-Conv, which converts the convolutional operation into matrix multiplication. Ref. [54] generates different resolution octree cells in each decoder layer by convolutional operations on feature maps, and then decides whether to propagate the features to the next layer according to the label (propagating features if "boundary" and skipping feature propagation if "mixed").
Besides decoder model design for synthesizing voxel grids, shape analysis methods have also been designed using octrees. However, It is difficult to use conventional octree structure [52] in deep networks. Many researchers have tried to resolve the problem by designing new structures for octrees, and special operations such as convolution, pooling, and unpooling on octrees. Riegler et al. [21] proposed OctNet. Its octree representation has a more regular structure than a traditional octree, which places a shallow octree in cells of a regular 3D grid. Each shallow octree can have up to 3 levels and is encoded in 73 bits. Each bit determines if the corresponding cell needs to be split. Wang et al. [22] also proposed an octree-based convolutional neural network called O-CNN, where the model also removes pointers like a shallow octree [21] and stores the octree data and structure using a series of vectors, including shuffle key vectors, labels, and input signals.
Instead of representing voxels, octree structure can also be utilized to represent 3D surfaces with planar patches. Wang et al. [55] proposed adaptive O-CNN, based on a patch-guided adaptive octree, which divides a 3D surface into a set of planar patches restricted by bounding boxes corresponding to octants. They also provided an encoder and a decoder for the octree defined by this paper.

Initial work
The typical point-based representation is also referred to as a point cloud or point set. It can be raw data generated by a 3D scanning device. Because of its unordered and irregular structure, this kind of representation is relatively difficult to cope with using traditional deep learning methods. Therefore, most researchers avoided directly using point clouds in the early stages of deep learning-based geometry research. One of the first models to generate point clouds by deep learning came out in 2017 [20]. The authors designed a neural network to learn a point sampler based on a 3D point distribution. The network takes a single image and a random vector as input, and outputs an N × 3 matrix representing a predicted point set (x, y, z coordinates for N points). Chamfer distance (CD)andearth mover's distance (EMD) [56] were used as loss functions to train the networks.

PointNet
At almost the same time, Charles et al. [6] proposed PointNet for shape analysis, which was the first successful deep network architecture to directly process point clouds without unnecessary rendering. Its pipeline is illustrated in Fig. 2. Taking account of three properties of point sets mentioned in Ref. [6], PointNet has three components in its network, including using max-pooling layers as symmetry functions for dealing with lack of ordering, concatenating global and local features for point interaction, and jointly aligning the network for transformation invariance. Based on PointNet, Qi et al. further improved this model in PointNet++ [7], overcoming the problem that PointNet cannot capture and deal well with local features induced by the metric. In comparison to PointNet, PointNet++ introduces a hierarchical structure, allowing it to capture features at different scales, improving its ability to extract 3D shape features. As PointNet and PointNet++ showed state-of-the-art performance for shape classification and semantic segmentation, more and more deep learning models were proposed based on point-based representations.

CNNs for point clouds
Some research works focus on applying CNNs to analysis of irregular and unordered point clouds. Li et al. [24] proposed PointCNN for point clouds and designed the X -transformation to weight and permute the input point features, guaranteeing equivariance for different point orders. Each feature matrix must be multiplied by the X -transformation matrix before passing through the convolutional operator. This process is called the X -Conv operator, which is the key element of PointCNN. Wang et al. [57] proposed DGCNN, a dynamic graph CNN architecture for point cloud classification and segmentation. Instead of processing point features like PointNet [6], DGCNN first connects neighboring points in spatial or semantic space to generate a graph, and then captures local geometric features by applying the EdgeConv operator to it. Moreover, unlike other graph CNNs which process a fixed input graph, DGCNN changes the graph to obtain new nearest neighbors in feature space in different layers, which is beneficial in providing larger and sparser receptive fields.

Other point cloud processing techniques using
NNs Klokov et al. [58] proposed the k-d-network to process point clouds based on the form of k-d-trees. Yang et al. [59] proposed FoldingNet, an end-toend auto-encoder for further compressing a pointbased representation with unsupervised learning. Because point clouds can be transformed into a 2D grid by folding operations, FoldingNet integrates folding operations in their encoder-decoder to recover input 3D shapes. Mehr et al. [60] further proposed DiscoNet for 3D model editing by combining multiple autoencoders specifically trained for different types of 3D shapes. The autoencoders use pre-learned mean geometry of 3D training shapes as their templates. Meng et al. [61] proposed VV-Net (voxel VAE net) for point segmentation; it represents a point cloud by a structured voxel representation. Instead of using a Boolean value to represent occupancy of each voxel as in a normal volumetric representation, it uses a latent code computed by an RBF-VAE, a variational autoencoder based on radial basis function (RBF) interpolation of points, to describe point distribution within a voxel. This representation is used to extract intrinsic symmetry of point clouds using a group equivariant CNN, and the output is combined with PointNet [6] for better segmentation performance.

Observations
Although point-based representation can be more easily obtained from 3D scanners than other 3D representations, this raw form of 3D shape is typically unsuitable for 3D shape analysis, due to noise and data sparsity.
Therefore, unlike other representations, it is essential for methods using point-based representation to incorporate an upsampling module to obtain fine-grained point clouds: see PU-NET [62], MPU [63], PU-GAN [64], etc. Additionally, point cloud registration is also an essential preprocessing step to fuse points from multiple scans: it aims to calculate rigid transformation parameters to align the point clouds. Wang et al. [65] proposed deep closest point (DCP), which extends the traditional iterative closest point (ICP) method [66], using a deep learning method to obtain the transformation parameters. Recently, Guo et al. [3] presented a survey focusing on deep learning models for point clouds, which provides more details in this field.

Mesh-based representations
Unlike point-based representations, mesh-based representations provide connectivity between neighboring points, so are more suitable for describing local regions on surfaces. As a typical type of representation in non-Euclidean space, mesh-based representations can be processed by deep learning models both in spatial and spectral domains [1].

Parametric representations for meshes
Directly applying CNNs to irregular data structures like meshes is non-trivial. A handful of approaches have emerged that map 3D shape surfaces to 2D domains such as 2D geometry images which can also be regarded as another 3D shape representation, and then apply traditional 2D CNNs to them [67,68]. Based on geometry images, Sinha et al. [69] proposed SurfNet for shape generation using a deep residual network. Similarly, Shi et al. [70] projected 3D models into cylinder panoramic images, which are then processed by CNNs. Other methods convert mesh models into spherical signals, using a convolutional operator in the spherical domain for shape analysis. To address high-resolution signals on 3D meshes, in particular texture information, Huang et al. [71] proposed TextureNet to extract features, using a 4rotationally symmetric (4-RoSy) field to parameterize surfaces. In the following, we review deep learning models according to how meshes are directly treated as input, and introduce generative models working on meshes.

Graphs
The mesh-based representation is constructed from sets of vertices and edges, and can be seen as a graph. Some models have been proposed based on the graph spectral theorem. They generalize CNNs on graphs [72][73][74][75][76] by eigen-decomposition of Laplacian matrices, generalizing convolutional operators to the spectral domain of graphs. Verma et al. [77] proposed another graph-based CNN, FeaStNet, which computes the receptive fields of the convolution operator dynamically. Specifically, it determines assignment of neighborhood vertices using features obtained from networks. Hanocka et al. [29] also designed operators for convolution, pooling, and unpooling for triangle meshes, and proposed MeshCNN. Unlike other graphbased methods, it focuses on processing features stored in edges, using a convolution operator applied to the edges with a fixed number of neighbors and a pooling operator based on edge collapse. MeshCNN extracts 3D shape features with respect to specific tasks, and learns to preserve important features and ignore unimportant ones.

2-Manifolds
The mesh-based representation can be viewed as a discretization of a 2-manifold. Several works have been designed using 2-manifolds with a series of refined CNN operators adapted to such non-Euclidean spaces. These methods define their own local patches and kernel functions when generalizing CNN models. Masci et al. [15] proposed geodesic convolutional neural networks (GCNNs) for manifolds, which extract and discretize local geodesic patches and apply convolutional filters to these patches in polar coordinates. The convolution operator works in the spatial domain and their geodesic CNN is quite similar to conventional CNNs applied in Euclidean space. Localized spectral CNNs [78] proposed by Boscaini et al. apply windowed Fourier transforms in non-Euclidean space. Anisotropic convolutional neural networks (ACNNs) [ 79] use an anisotropic heat kernel to replace the isotropic patch operator in GCNN [15], giving another solution to avoid ambiguity. Xu et al. [80] proposed directionally convolutional networks (DCNs), which define local patches based on faces of the mesh representation. They also designed a two-stream network for 3D shape segmentation, which takes local face normals and the global face distance histogram as training input. Moti et al. [81] proposed MoNet which replaces the weight functions in Refs. [15,79] with Gaussian kernels with learnable parameters. Fey et al. [82] proposed SplineCNN which uses a convolutional operator based on B-splines. Pan et al. [83] designed a surface CNN for irregular 3D surfaces; it preserves the standard CNN property of translation equivariance by using parallel translation frames and group convolutional operations. Qiao et al. [84] proposed the Laplacian pooling network (LaplacianNet) for 3D mesh analysis. It considers both spectral and spatial information from the mesh, and contains 3 parts: preprocessing features as network input, mesh pooling blocks to split the surface and cluster patches for feature extraction, and a correlation network to aggregate global information.

Generative models
There are also many generative models for meshbased representation. Wang et al. [23] proposed Pixel2Mesh for reconstructing 3D shapes from single images; it generates the target triangular mesh by deforming an ellipsoidal template. As shown in Fig. 3, the Pixel2Mesh network is implemented based on a graph-based convolutional networks (GCNs) [1] and generates the target mesh from coarse to fine by an unpooling operation. Wen et al. [85] advanced Pixel2Mesh and proposed Pixel2Mesh++, which extends single image 3D shape reconstruction to 3D shape reconstruction from multi-view images. To do so, Pixel2Mesh++ introduces a multiview deformation network (MDN) to the original Pixel2Mesh; it incorporates cross-view information in the process of mesh generation. Groueix et al. [86] proposed AtlasNet, which generates 3D surfaces from multiple patches. AtlasNet learns to convert 2D square patches into 2-manifolds to cover the surface of 3D shapes using an MLP (multi-layer perceptron). Ben-Hamu et al. [87] proposed a multichart generative model for 3D shape generation. It uses a multi-chart structure as input; the network architecture is based on standard image GAN [41]. The transformation between 3D surface and multichart structure is based on Ref. [68]. However, methods based on deforming a template mesh into the target shape cannot express the complex topology of some 3D shapes. Pan et al. [88] proposed a new single-view reconstruction method which combines a deformation network and a topology modification network to model meshes with complex topology. In the topology modification network, faces with high distortion are removed. Tang et al. [89] proposed generating complex topology meshes using a skeleton-bridged learning method, as a skeleton can well preserve topology information. Instead of generating triangular meshes, Nash et al. [90] proposed PolyGen to generate a polygon mesh representation. Inspired by neural autoregressive models in other fields like natural language processing, they regarded mesh generation as a sequential process, and designed a transformer-based network [91], including a vertex model and a face model. The vertex model generates a sequence of vertex positions and the face model generates variable-length vertex sequences conditioned on input vertices.

Implicit representations
In addition to explicit representations such as point clouds and meshes, implicit representations have increased in popularity in recent studies. A major reason is that implicit representations are not limited to fixed topology or resolution. An increasing number of deep models define their own implicit representations and build on them for various methods of shape analysis and generation.

Occupancy and indicator functions
Occupancy and indicator functions are one way to represent 3D shapes implicitly. An occupancy network was proposed by Mescheder et al. [8]t o learn a continuous occupancy function as a new 3D shape representation for neural networks. The occupancy function reflects 3D point status with respect to the 3D shape's surface, where 1 means inside the surface and 0 otherwise. Researchers regarded this problem as a binary classification task and designed an occupancy network which inputs 3D point position and 3D shape observation and outputs the probability of occupancy. The generated implicit field is then processed by a multi-resolution isosurface extraction method MISE and marching cubes algorithm [10] to obtain a mesh. Moreover, researchers have introduced encoder networks to obtain latent embeddings. Similarly, Chen et al. [26] designed IM-NET as a decoder for learning generative models, which also takes an implicit function in the form of an indicator function.

Signed distance functions
Signed distance functions (SDFs) are another form of implicit representation. They map a 3D point to a real value instead of a probability, the value indicating the spatial relation and distance to the 3D surface. Let SDF(x) be the signed distance value of a given 3D point x ∈ R 3 . Then SDF(x) > 0 if point x is outside the 3D shape, SDF(x) < 0 if point x is inside the shape, and SDF(x) = 0 if point x is on the surface. The absolute value of SDF(x) gives the distance between point x and the surface. Park et al. [25] proposed DeepSDF and introduced an auto-decoder-based DeepSDF as a new 3D shape representation. Xu et al. [9] also proposed deep implicit surface networks (DISNs) for single-view 3D reconstruction based on SDFs. Thanks to the advantages of SDFs, DISN was the first to reconstruct 3D shapes with flexible topology and thin structure in the single-view reconstruction task, which is difficult for other 3D representations.

Function sets
Occupancy functions and signed distance functions represent the 3D shape surface by a single function learned by a deep neural network. Genova et al. [92,93] proposed representing an entire 3D shape by combining a set of shape elements. In Ref. [92], they proposed structured implicit functions (SIFs); each element is represented by a scaled axis-aligned anisotropic 3D Gaussian, and the sum of these shape elements represents the whole 3D shape. The Gaussians' parameters are learned by the CNN. Ref. [93] improved the SIF and proposed deep structured implicit functions (DSIFs) which added deep neural networks as deep implicit functions (DIFs) to provide local geometry details. To summarize, DSIF exploits SIF to depict coarse information for each shape element, and applies DIF for local shape details.

Approach without 3D supervision
The above implicit representation models need to sample 3D points in a 3D shape bounding box as ground truth and train the model supervised with 3D information. However, 3D ground truth may not be readily available in some situations. Liu et al. [30] proposed a framework which learns implicit representations without explicit 3D supervision. The model uses a field probing algorithm to bridge the gap between 3D shape and 2D images, using a silhouette loss to constrain 3D shape outline, and geometry regularization to constrain the surface to be plausible.

Structure-based representations
Recently, more and more researchers have realized the importance of integrating structural information into deep learning models. Primitive representations are a typical kind of structure-based representation which explicitly depict 3D shape structure: they represent a 3D shape using several primitives such as oriented 3D boxes, using a compact parameter set. Instead of providing a description of geometric details, the primitive representation concentrates on the overall structure of a 3D shape. More importantly, obtaining a primitive representation encourages a method to generate more detailed and plausible 3D shapes.

Linear organization
Observing that humans often regard 3D shapes as a collection of parts, Zou et al. [11] proposed 3D-PRNN, which applies LSTM in a primitive generator to generate primitives sequentially. The resulting primitive representations show great efficiency for depicting simple and regular 3D shapes. Wu et al. [94] further proposed an RCNN-based method called PQ-NET which also regards 3D shape parts as a sequence. The difference is that PQ-NET encodes geometry features in the network. Gao et al. [27] proposed a deep generative model named SDM-NET (structured deformable mesh-net). They designed a two-level VAE, containing a PartVAE for part geometry and an SP-VAE (structured parts VAE) for both structure and geometry features. Each shape part is encoded in a well designed form, which records both structure information (symmetry, supporting, and supported) and geometry features.

Hierarchical organization
Li et al. [12] proposed GRASS (generative recursive autoencoders for shape structures), one of the first attempts to encode 3D shape structure using a neural network. They describe shape structure in a hierarchical binary tree, in which child nodes are merged into the parent node by either adjacency or symmetry relations. Leaves in this structure tree represent oriented bounding boxes (OBBs) and geometry features for each part, while intermediate nodes represent both the geometric features of child nodes and relations between child nodes. Inspired by recursive neural networks (RvNNs) [33,95], GRASS also recursively merges the codes representing the OBBs into a root code which depicts the whole shape structure. The architecture of GRASS has three parts: an RvNN autoencoder for encoding a 3D shape into a fixed length code, a GAN for learning the distribution of root codes and generating plausible structures, and another autoencoder (inspired by Ref. [45]) for synthesizing the geometry of each part. Furthermore, to synthesize fine-grained geometry in voxels, structure-aware recursive features (SARFs) are used, which contain both the geometric features of each part and global and local OBB layout. However, GRASS [12] uses a binary tree to organize the part structure, which leads to ambiguity; binary trees are unsuitable for large scale datasets. To address the problem, Mo et al. [28] proposed StructureNet which organizes the hierarchical structure in the form of graphs.
The BSP-Net (binary space partitioning-Net) proposed by Chen et al. [31] was the first method to depict sharp geometric features. It constructs a 3D shape from convex components organized in a BSP-tree [31]. The BSP-net includes three layers, for hyperplane extraction, hyperplane grouping, and shape assembly. The convex components can also be seen as a new form of primitive which can represent geometric details of 3D shapes rather than general structures.

Structure and geometry
Researchers have tried to encode 3D shape structure and geometric features separately [12] or jointly [96]. Wang et al. [97] proposed a global-to-local (G2L) generative model to generate man-made 3D shapes from coarse to fine. To address the problem that GANs cannot generate geometric details well [19], G2L first applies a GAN to generate a coarse voxel grid with semantic labels that represents shape structure at the global level, and then puts voxels separated by semantic labels into an autoencoder called the part refiner (PR) to optimize geometric details part by part at the local level. Wu et al. [96] proposed SAGNet for detailed 3D shape generation; it encodes structure and geometry jointly using a GRU [47] architecture in order to find relationships between them. SAGNet shows better performance for modeling tenon-mortise joints than other structurebased learning methods.

Deformation-based representations
Deformable 3D models play an important role in computer animation. However, most methods mentioned above focus on rigid 3D models, and pay less attention to deformation of non-rigid models. Unlike other representations, deformationbased representations parameterize the deformation information and achieve better performance for nonrigid 3D shapes such as articulated models.

Mesh-based approaches
A mesh can be seen as a graph, which is convenient when manipulating the vertex positions while maintaining the connectivity between vertices. Therefore, a great number of methods choose meshes to represent deformable 3D shapes. Based on this property, some mesh-based generation methods generate target shapes by deforming a mesh template [23,27,85,88], and these methods can also be regarded as deformation-based methods. The graph structure makes it easy to store deformation information as vertex features, which can be seen as a deformation representation.
Gao et al. [17] designed an efficient, rotation-invariant deformation representation called rotation-invariant mesh difference (RIMD), which achieves high performance for shape reconstruction, deformation, and registration. Based on Ref. [17], Tan et al. [98] proposed Mesh VAE for deformable shape analysis and synthesis. It takes RIMD as the feature inputs of VAE and uses fully connected layers for the encoder and decoder. Further, Gao et al. [99] designed an as-consistent-as-possible (ACAP) representation to constrain the rotation angle and rotation axes between adjacent vertices in the deformable mesh, to which graph convolution is easily applied. Tan et al. [100] proposed SparseAE based on the ACAP representation [99]. It applies graph convolutional operators [101] with ACAP [99] to analyse mesh deformations. Gao et al. [102] proposed VC-GAN (VAE CycleGAN) for unpaired mesh deformation transfer. It is the first automatic approach for unpaired mesh deformation transfer. It takes the ACAP representation as input, and encodes the representation into latent space by a VAE, and then transfers deformation between source and target in the latent space domain with cycle consistency and visual similarity consistency. Gao et al. [27] first viewed the geometric details shown in Fig. 5 as the deformations. Based on previous techniques [98][99][100]102], geometric details can be encoded and generated. The structure in Ref. [27] is also analyzed to determine stable support in Ref. [103]. Yuan et al. [104] applied a newly designed pooling operation based on mesh simplification and graph convolution to the VAE architecture, which also takes ACAP representation as input to the network. Tan et al. [105] used ACAP representation for simulating thin-shell deformable materials, applying a graph-based CNN to embed highdimensional features into low-dimensional features. In addition to considering a single deformable mesh, mesh sequences play an important role in computer animation. The deformation-based representation ACAP [99] is suitable for representing a mesh sequence. The deformation-based representation and related works are illustrated in Fig. 4.

Implicit surface-based approaches
With the development of implicit surface representations, Jeruzalski et al. [32] proposed a method to represent articulated deformable shapes by pose parameters, called neural articulated shape approximation (NASA). Pose parameters record the transformation of bones defined in models. They compared three different network architectures, including an unstructured model (U), a piecewise rigid model (R), and a piecewise deformable model (D) in the training dataset and test dataset, which opens another direction to represent deformable 3D shapes.

Datasets
With the development of 3D scanners, 3D models are easier to obtain, and more and more 3D shape datasets have been proposed with different 3D representations. The larger datasets with more details bring more challenges for existing techniques, further promoting the development of deep learning on different 3D representations.
The datasets can be divided into several types according to different representations and different applications. Choosing the appropriate type benefits performance and generalization for learning based models.

RGB-D images
RGB-D image datasets can be collected by depth sensors like Microsoft Kinect. Most RGB-D image datasets can be regarded as a video sequence. The NYU Depth [106,107] indoor scene RGB-D image dataset was first provided as a benchmark for the segmentation problem. Version 1 [106] has 64 categories while the version 2 [107] has 464 categories. The KITTI [108] dataset provides outdoor scene images aimed mainly at autonomous driving, and contains 5 categories including road, city, residential, campus, and person. Depth maps for the images can be calculated using the development kit provided with the KITTI dataset. This dataset also contains 3D object annotations for applications such as object detection. ScanNet [109] is a large annotated RGB-D video dataset which includes 2.5M views with 3D camera pose of 1513 scenes, surface reconstructions, and semantic segmentations.

Man-made 3D objects
ModelNet [13] is a famous CAD model dataset for 3D shape analysis, including 127,915 3D CAD models in 662 categories. Two subsets, ModelNet10 and ModelNet40, include 10 and 40 categories from the whole dataset; in each subset, the 3D models are aligned manually. ShapeNet [111] provides a larger dataset, containing more than 3 million models in more than 4k categories. It also contains two smaller subsets: ShapeNetCore and ShapeNetSem. ShapeNet [111] provides rich annotations for 3D objects in the dataset, including category labels, part labels, symmetry information, etc. ObjectNet3D [112]i sa large-scale dataset for 3D object recognition from 2D images. It includes 201,888 3D objects in 90,127 images and 44,147 different 3D shapes. The dataset is annotated with 3D pose parameters which align 3D objects with 2D images. SUNCG [113] includes full 3D models of rooms, and is suitable for 3D scene analysis and scene completion tasks. Its 3D models are represented by dense voxel grids with object annotations. The whole dataset includes 49,884 valid floors with 404,058 rooms and 5,697,217 object instances. PartNet [114] provides a more detailed CAD model dataset with fine-grained, hierarchical part annotations, bringing more challenges, and resources for 3D object applications such as semantic segmentation, shape editing, and shape generation. 3D-Future [115] provides a large-scale furniture dataset, which includes over 20,000 scenes in over 5000 rooms with over 10,000 3D instances. Each 3D shape is of high quality; this dataset currently has the best texture information.

Non-rigid models
TOSCA [116] is a high-resolution 3D non-rigid model dataset containing 80 objects in 9 categories, in mesh representation. Objects in the same category have the same connectivity. FAUST [117] is a dataset of 3D human body scans of 10 different people in a variety of poses; ground truth correspondences are also provided. Because FAUST was proposed for real-world shape registration, the scans are noisy and incomplete, but the corresponding ground truth is water-tight and aligned. AMASS [118] provides a large and varied human motion dataset, gathering previous mocap datasets in a consistent framework and parameterization. It contains 344 subjects, 11,265 motions, and more than 40 hours of recordings.

Shape analysis and reconstruction
The shape representations discussed above are fundamental for shape analysis and shape reconstruction. In this section, we summarize representative works in these two directions respectively and compare their performance.

Shape analysis
Shape analysis methods usually extract latent codes from different 3D shape representations using different network architectures. The latent codes are then used for specific applications like shape classification, shape retrieval, shape segmentation, etc. Different representations are usually suited to different applications.
We now review the performance of different representations in different models and discuss suitable representations for specific applications.

Shape classification and retrieval
Shape classification and retrieval are basic problems of shape analysis. Both rely on feature vectors extracted from the analysis networks. For shape classification, the datasets ModelNet10 and ModelNet40 [13] are widely used as benchmarks and Table 2 shows the accuracy of some different methods on ModelNet10 and ModelNet40. For shape retrieval, given a 3D shape as a query, the target is to find the most similar shape(s) in the dataset that match the query. Retrieval methods usually learn to find a compact code to represent the object in a latent space, and seek the closest object based on Euclidean distance, Mahalanobis distance, or some other distance metric. Unlike the classification task, shape retrieval has a  number of evaluation measures, including precision, recall, mAP (mean average precision), etc.

Shape segmentation
Shape segmentation aims to discriminate the parts of a 3D shape. This task plays an important role in understanding 3D shapes. Mean intersection-overunion (mIOU) is often used as the evaluation metric for shape segmentation. Most researchers choose to use point-based representation for the segmentation task [6,7,24,58,61].

Symmetry detection
Symmetry is important in 3D shapes, and can be further used in many other applications such as shape alignment, registration, completion, etc. Gao et al. [120] designed the first unsupervised deep learning method, PRS-Net (planar reflective symmetry net), to detect planar reflective symmetry in 3D shapes, using a new symmetry distance loss and a regularization loss, as illustrated in Fig. 6. It proved robust in the presence of noisy and incomplete input, and more efficient than traditional methods. As symmetry is largely determined by overall shape, PRS-Net is based on a 3D voxel CNN and has high performance at low resolution.

Shape reconstruction
Learning based generative models have been proposed for different representations, which is also an important field in geometry learning. Reconstruction applications include single-view shape reconstruction, shape generation, shape editing, etc. The generation methods can be summarized on the basis of representation. For voxel-based representations, learning based models try to predict the occupancy probability of each voxel. For point-based representations, learning based models either sample 3D points in space or fold the 2D grid into a target 3D object. For mesh-based representations, most generation methods choose to deform a mesh template into the final mesh. A recent study shows that more and more methods choose to use a structured representation and generate 3D shapes in a coarse-tofine way.

Summary
This survey has reviewed deep learning methods based on different 3D object representations. We first overviewed different 3D representation learning models. The tendency in geometry learning can be summarized to be to reduce computation and memory demands, and to increase detail and structure. Then, we introduced 3D datasets widely used in research. These datasets provide rich resources and support evaluation of data-driven learning methods. Finally, Fig. 6 Pipeline of PRS-Net. Reproduced with permission from Ref. [120]. we discuss 3D shape applications based on different 3D representations, including shape analysis and shape reconstruction. Different representations suit different applications; it is important to choose suitable 3D representations for specific tasks. Lin Gao received his bachelor degree in mathematics from Sichuan University and Ph.D. degree in computer science from Tsinghua University.
He is currently an associate professor at the Institute of Computing Technology, the Chinese Academy of Sciences.
His research interests include computer graphics and geometric processing.
He received a Newton Advanced Fellowship award from the Royal Society in 2019.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.