1 Introduction

In recent years, the demand for three-dimensional (3D) content has reached an unprecedented level as virtual reality (VR) and augmented reality (AR) applications have become increasingly influential. Fancy terms such as Metaverse and digital human are being created and used in different contexts. However, it is a challenge to acquire the vast amount of 3D content that is needed to build these applications. Traditional approaches to creating 3D shapes rely heavily on trained artists and are struggling to keep up with the growing demand. To solve this problem, various methods have been proposed to generate 3D shapes, making 3D content generation an active area of computer graphics and computer vision.

However, the inherent complexity and variety of 3D data makes 3D content generation a difficult task. Unlike two-dimensional (2D) data, which can be effectively represented by an array, a number of representations have been proposed for 3D content generation. These representations include meshes, voxels, point clouds, structures (or primitives), deformation-based representations, multi-view images, and implicit representations [1, 2]. Many methods and architectures for 3D content generation have been built on top of these representations. Traditionally, researchers have focused on explicit representations such as meshes, voxels, and point clouds [35] because they are easy to render and edit. With the rapid development of deep learning and neural networks, function-based implicit representations have become popular [69] since neural networks can be flawlessly transferred to an implicit function. It has been observed that these methods enhanced by deep learning outperform traditional methods. However, these methods omit the appearance of 3D shapes, and they often need abundant ground truth 3D data. Led by the pioneering work [10], neural radiance fields (NeRFs) are rapidly gaining attention for their ability to learn and generate appearance along with geometry from just a few multi-view images [11, 12]. Furthermore, EG3D [13] shows the possibility of compressing the 3D representation of NeRF into three feature planes (triplanes). More recently, Dreamfusion [14] and a series of follow-up works have taken advantage of the power of 2D diffusion models [15] and generated NeRFs from multi-modal conditions. These studies have contributed to the increasing popularity of 3D shape generation using implicit representation. Existing surveys [1, 2] usually involve generating implicit shapes along with other types of representations such as meshes and point clouds, and they are generally based on works published before 2022. However, recent developments in the above-mentioned methods for generating 3D content have led to numerous studies, achieving high-quality generation results. The wealth of work can also be confusing for researchers attempting to get involved. Therefore, a comprehensive survey of recent work is needed.

In this survey, we focus on recently proposed implicit representation-based 3D shape generation methods. We categorize the implicit representations actively used in the literature into three types: signed distance fields, radiance fields, and triplanes. In Sect. 2, we will first introduce these popular representations, and then we describe various architectures used to generate geometry from these representations. In Sect. 3, we will list and analyze works according to these representations and architectures. In Sect. 4, we present some open problems and future research directions and finally draw conclusions.

2 Background

In this section, we briefly introduce the preliminary knowledge of implicit 3D shape generation. Section 2.1 describes three different implicit representations: the signed distance field, radiance field, and triplane, and Sect. 2.2 covers some commonly used deep learning methods used to generate 3D data.

2.1 Implicit representation of 3D shapes

2.1.1 Signed distance fields (SDFs)

Signed distance fields (SDFs) are essentially functions defined in 3D space: \(f(\boldsymbol{x}): \mathbb{R}^{3} \rightarrow \mathbb{R}\). The level set of this function \(\mathcal{S} = \{\boldsymbol{s} | \boldsymbol{s} \in \mathbb{R}^{3}, f(\boldsymbol{s}) = 0\}\) is the surface of the underlying 3D geometry, and \(|f(\boldsymbol{x})|\) for any other points represents the minimum distance from x to \(\mathcal{S}\). The sign of \(f(\boldsymbol{x})\) is positive if x lies inside \(\mathcal{S}\) and negative otherwise. As a function, SDF is more flexible than the common explicit representations such as point clouds or meshes, and inherently allows topology manipulations such as constructive solid geometry (CSG) operations. Moreover, SDF allows the use of a technique known as “sphere tracing” [16], which can accelerate the rendering of path tracing. Owing to these properties, SDF is popular in several areas of computer graphics literature. SDF can be easily transformed to meshes using algorithms such as Marching Cubes [17], and this process is performed by means of deep marching tetrahedra (DMTet) [18]. DMTet can be used as a bridge between SDF and mesh, enabling much previous work such as neural rendering built on meshes to be transferred to SDF. We include this type of work in signed distance field-based 3D content generation, since the underlying representation is SDF.

2.1.2 Radiance fields (RFs)

Radiance fields (RFs) are a pair of functions: \(c(\boldsymbol{x}): \mathbb{R}^{3} \rightarrow [0, 1]^{3}\) and \(d(\boldsymbol{x}): \mathbb{R}^{3} \rightarrow [0, +\infty ]\). c and d are the radiance color and density of a point, respectively. They can be rendered by volume rendering and ray marching algorithms [19]. RFs are often accompanied by a positional encoding \(e(\boldsymbol{x}) = (\sin (2^{0}\uppi \boldsymbol{x}, \cos (2^{0}\uppi \boldsymbol{x}), \ldots , \sin (2^{L-1}\uppi \boldsymbol{x}, \cos (2^{L-1}\uppi \boldsymbol{x}))\), where L is a hyperparameter controlling the dimension of the embedding layer. Positional encoding is the key for RFs to reconstruct high-frequency features of the geometry. The use of the radiance field as a representation was introduced in Ref. [10]. Although recently proposed, it has gained surprising popularity due to its ability to accurately reconstruct 3D geometry from only a few sparse multi-view images. However, the density field used by vanilla RFs struggles to define a clear surface for the geometry, limiting the fidelity of RFs as a representation. To solve this problem, VolSDF [20] and NeuS [21] use SDFs as the geometry and propose algorithms to transfer SDF values to the weight of volume rendering. Although they use SDFs, we regard works built upon them as radiance field-based works because they preserve the volume rendering and positional encoding techniques of RFs.

2.1.3 Triplanes

Triplanes are three 2D feature planes, each of which is represented by a \(N\times N \times C\), where N is the resolution of the feature planes, and C is the channel number of the feature planes. The three planes can be denoted by \(F_{xy}\), \(F_{xz}\) and \(F_{yz}\) since they are placed perpendicular to each other in 3D space, aligning to xy, xz and yz planes. The rendering of triplane-based geometry also uses ray marching. In contrast to RFs, the sample points are directly projected to \(F_{xy}\), \(F_{xz}\) and \(F_{yz}\) to sample the features via bilinear interpolation. Three sampled features are then concatenated and fed to a small multi-layer perceptron (MLP). The output of the MLP is often density or SDF values and color values, following the convention of RFs. Triplane was introduced by EG3D [13] in 2022, and the initial purpose of the triplane is shape generation. Given that triplanes and RFs often appear together, one can easily categorize triplanes into another “trick” in RFs such as positional encoding, but we decide to list these triplanes independently as a representation because: 1) Triplanes can be directly generated from random noise or latent vectors by utilizing methods like StyleGAN [22]. 2) Triplanes can be transferred to other function-based implicit representations such as occupancy fields by modifying the head of the MLP. 3) A great deal of recent work is built based on triplanes, so it makes sense to create a category for them when reviewing. Note that the SDF-based generation method is only capable of generating shapes, while the RF- or triplane-based methods can generate both shapes and appearances simultaneously due to their ability to combine the color with the geometry in an implicit neural field.

2.2 Architectures for 3D shape generation

2.2.1 Generative Adversarial Networks (GANs)

Generative adversarial networks (GANs) [23] consist of a pair of neural networks. One is called the generator and the other is called the discriminator. The generator produces a data sample from a random noise or a condition, while the discriminator takes the sample and tries to distinguish it from the data taken from the real distribution. The generator and the discriminator are trained simultaneously, which is why they are called adversarial networks. GANs are flexible and can be used to generate both 2D and 3D data under various conditions.

2.2.2 Variational autoencoders (VAEs)

Variational autoencoders (VAEs) [24] are encoder-decoder structures, in which the encoder compresses data samples to a latent vector z, and the decoder maps it back to the sample. A good feature of the VAEs is the space that contains the latent vectors. When VAE is trained, the latent space is naturally obtained and can be used as an embedding of the original data distribution, supporting operations such as interpolation. In this context, VAEs are more controllable than GANs.

2.2.3 Diffusion models (DMs)

Diffusion models (DMs) [25] generate data by assuming a noise of the same dimension as the data and iteratively “denoising” it using the same network. During training, Gaussian noise is added to the real data, and the network is supervised to recover the data from the noisy data by predicting the added noise. Since performance of the original model in the data space is slow, the latent diffusion model [15] runs the DM in the latent space and uses an encoder-decoder structure to link the latent space to the data space.

2.2.4 2D-to-3D models

2D-to-3D models are special kind of models introduced by Dreamfusion [14]. It takes advantage of the pre-trained large latent diffusion models such as stable diffusion [15] available on the Internet and RFs that can reconstruct 3D shapes from a few 2D images. These models typically utilize a kind of loss like the score distillation sampling (SDS) loss [14] to obtain the gradient from the frozen latent diffusion models (LDMs), using it to update the weight of the NeRF.

3 Generation of implicit shapes

In this section, we review in detail recent work on implicit representation-based 3D shape generation. We categorize the works according to the type of representation they use, including signed distance fields (Sect. 3.1), radiance fields (Sect. 3.2), and triplanes (Sect. 3.3). A brief timeline of the generation of implicit shapes is shown in Fig. 1.

Figure 1
figure 1

A brief timeline of implicit representation based 3D shape generation

3.1 Signed distance field-based shape generation

Signed distance fields implicitly represent shapes by predicting the distance values for sample points in the 3D space. The distance values’ signs provide inside-outside information indicating which points are inside the surface and which are outside. These distance values can also be fed to the Marching Cubes algorithm [17] to extract an explicit triangle mesh. In Table 1, we briefly categorize shape-generating methods based on SDF according to the generation architecture and type of the generated results.

Table 1 Overview of works based on SDFs according to the generator and the generated results. GAN stands for generative adversarial network, VAE for variational autoencoders, and DM for diffusion models

3.1.1 GANs and VAEs

With the emergence of SDF, several pioneering works started to represent and generate 3D shapes using the SDF representation. Two concurrent works [8, 9] model the 3D space with occupancy grids where 1 represents inside and 0 represents outside. Later, a convolution-based occupancy network [37] predicts learnable features defined in the volume space or on multiple planes from an input point cloud, and the signed distance value of a sample point is determined by the interpolated features and a decoder network. Instead of modeling the whole shape at once, BAE-Net [38] and BSP-Net [26] learn to segment and reconstruct shape parts in an unsupervised manner based on IM-Net [9]. RIM-Net [39] also decomposes shapes into multiple parts but further predicts hierarchical structure. With the availability of fine-grained segmentation datasets [40, 41], researchers [42] have started to model shape geometry at the part level to capture more details where each part is represented by a latent vector and an occupancy decoder and when generating shapes, these parts are generated sequentially by an RNN network. SDF is a continuous version of occupancy grids that can represent more geometry details. DeepSDF [6] was the first to use SDF to represent a 3D shape and it uses an auto-decoder architecture to jointly optimize the latent vectors and the decoder network. To improve the generation and reconstruction quality, PIFu [34] and PIFuHD [35] propose extracting pixel-aligned features from human body images as extra outputs for the decoder network. DISN [7] shares a similar idea but focuses on general 3D objects. D2IM-Net [43] reconstructs and generates more geometric details by separating the signed distance field learning as a base signed distance field and a displacement value. SDF-StyleGAN [44] extends the 2D StyleGAN [22] to 3D and both global and local discriminators are deployed to ensure the generation quality. To reduce the flexibility of generated shapes, template-based methods [45, 46] have been proposed for modeling shapes in a specific category with a template signed distance field and a displacement field. With the discrete encoding [47] becoming popular in data compression, ShapeFormer [48] learns to generate 3D shapes from incomplete point cloud data by first encoding the incomplete data as incomplete discrete indices using VQ-VAE [47] and a transformer network [49] to fill the missing indices. AutoSDF [50] shares a similar idea but splits the whole 3D space into multiple 3D grids and encodes these grids as indices of the codebook in VQ-VAE [47]. A transformer network later takes the indices sequence as input and generates them auto-regressively. AutoSDF [50] allows not only the completion of shapes from incomplete data, but also the generation of random shapes. Apart from geometry generation, TextureFields [51] was the first to explore texture generation based on image and shape conditions, but it can synthesize only the rendered results of a given shape because both geometry and texture are implicit. To obtain explicit geometry and texture, DVR [52] represents geometry and texture with a single latent vector and uses an occupancy function and a texture function to predict the occupancy value and texture color for a sample point. AUV-Net [53] moves a step forward to learn an aligned UV parameterization network for shapes in the same category to allow seamless texture generation and transfer.

3.1.2 Diffusion

With recent advances in generative modeling with diffusion models, LION [27] applies the diffusion model to the 3D domain but takes point clouds as its 3D representation. MeshDiffusion [28] was the first to extend the diffusion model to an implicit representation. It represents the 3D shape with the Deep Marching Tetrahedra [18]. Li et. al. [30] proposed encoding local patches of 3D shapes into the voxel grids and training the diffusion model on a 3D grid. SDF-Diffusion [29] reduces the learning difficulty by first training a diffusion model on a low-resolution grid and performing a patch-based super-resolution to introduce additional geometric details. NeuralWaveletDiffusion [31] and NeuralWaveletDiffusion++ [32] instead convert the 3D signed distance volume into coefficient volumes using multi-scale wavelet decomposition. The diffusion model first generates a coarse coefficient volume and a detailed predictor models the geometric details. LAS-Diffusion [33] also uses a coarse-to-fine generation paradigm. It first trains an occupancy diffusion network to generate a sparse voxel grid and subdivide it into a voxel grid with higher resolution. Later, an SDF diffusion network is optimized to generate local details. Diffusion-SDF [54] first encodes shapes into triplane features and further compresses them into a compact latent vector. The diffusion model learns to generate 3D shapes by generating the latent vector. As regular grids and global latent code are proven less expressive in geometric modeling, 3DShape2VecSet [55] proposes representing a 3D shape with a set of latent vectors distributed irregularly in the 3D space. The feature of a sample point used to predict the occupancy value is determined by querying features from the set of latent vectors with a cross-attention layer. The diffusion model takes multiple sets of latent vectors as training data to generate a plausible set of latent vectors, which is decoded as a 3D shape. HyperDiffusion [56] overfits each 3D shape in a training set with an occupancy network and represents one with the optimized parameters in the occupancy network. Later, the parameters are taken as the training data for the diffusion model, which also means that the diffusion model operates in a “hyper” space.

3.1.3 Summary

SDF is a fundamental implicit representation of 3D shapes. Many works take SDF as its representation for geometry reconstruction and generation tasks. One advantage of SDF is that it provides clear inside-outside information that can be used to extract an explicit surface. This property can be well integrated into the geometry reconstruction pipeline. However, such a representation also has limitations. For example, it is difficult to represent thin structures or open surfaces using the SDF representation. Hence, introducing a more flexible and general representation like unsigned distance field [57] is a possible direction.

3.2 Radiance field-based shape generation

Radiance fields model the appearance and geometry of 3D shapes by using volume rendering (ray marching) algorithms and positional encoding [10]. This technique can efficiently reconstruct geometry given only a few multi-view posed images. To achieve higher geometry quality, NeuS [21] and VolSDF [20] extend radiance fields by using SDF instead of density as the geometry. In Table 2, we briefly categorize shape generation methods based on NeRF according to the generation architecture and the type of results generated.

Table 2 Overview of works based on NeRFs according to the generator and the generated results

3.2.1 GANs and VAEs

The first attempt to generate radiance fields is GRAF [11]. The generator of the GRAF is simply a “conditional” NeRF, which takes a random noise in addition to the positional encoding to render a random patch of the full image, and the patch is fed into the discriminator. NeRF-VAE [58] bases its generative model on VAE and uses multiple scenes to train the VAE. On the other hand, pi-GAN [59] replaces the activation function with trigonometric functions as proposed by SIREN [97], while avoiding convolution-based networks. GIRAFFE [12] extends GRAF by implementing the disentanglement of geometry and appearance features and by considering the scene in a compositional manner.

Wang et al. [60] utilized template and deformation fields on geometry to control the shape generation. VolumeGAN [61] introduces a feature volume and uses volume rendering to map it into the image space. GIRRAFFE HD [62] further improves the resolution of the generated image by using the super-resolution module [22]. Xu et al. [63] proposed another extension to the GRAF. To improve the fidelity of the generated geometry, they added a progressive sampling strategy to the GRAF. ShadeGAN [64] uses the consistency under multiple lighting conditions as a further constraint on the generation of shadows. StyleNeRF adopts the super-resolution of StyleGAN [22] to improve 3D consistency. This method uses a novel regularization loss and upsampler. StyleSDF [36] replaces the original density field of NeRF with SDF, and simultaneously renders a low-resolution image and a 2D feature map. The feature map is transferred to a high-resolution image with the 2D generator. Persistent Nature [93] participates in the terrain with a grid and uses an upsampler to generate fine-grained geometry. Discoscene [94] generates large scenes from a layout prior that consists of labeled bounding boxes and generates radiance fields in the boxes. GRAM [65] combines the primitive-based method and radiance fields, generating multiple manifolds and their radiance. Volume rendering is modified by directly integrating the radiance of these manifolds. GVP [66] is based on the same idea as GRAM, predicting multiple primitives with radiance fields defined in them. Apart from the works above that focus on generating “general” 3D shapes, generating human faces or bodies (or called “avatars”) is also an active topic. Multi-NeuS [84] attempts to directly generate 3D heads represented by SDF field of NeuS. Tewari et al. [85] further disentangled the face geometry and appearance by predicting a deformation field and an appearance network. Tang et al. [86] used an explicit parametric face model for better control of the generated faces. Volux-GAN [87] incorporates lighting in 3D face synthesis by using an environment map and decomposing the material, achieving relighting in the synthesized models. AnifaceGAN [88] generates a movable 3D face, using different codes to generate template and deformation fields and an imitation loss. EVA3D [82] introduces the SMPL human body prior, segments the human body into multiple bounding boxes, and subsequently generates radiance fields inside them. MetaHead [89] and GANHead [90] introduce additional priors such as semantic labels and FLAME representations to generate 3D human heads. GeneFace [91] and GeneFace++ [92] control talking faces directly with audio by controlling facial landmarks, and then the landmarks are used to control 3D faces.

3.2.2 Diffusion models

Recently, diffusion models have rapidly gained popularity since the proposal of LDM [15]. However, it is not easy to directly apply DMs to generate radiance fields: as a function, it is difficult to directly add noise to the RF. One of the methods for applying DMs is to use voxel-based radiance fields [67]. It utilizes the base NeRF model following the voxel-grid-based representations such as DVGO [98]. Voxel-based radiance field representations are fast to render, and 3D UNets can be used to implement the diffusion process. Holodiffusion [68] also adopts voxel grid representation for diffusion. They use a single 2D image as a prior and utilize a diffusion model to generate 3D shapes. Another type of diffusion model for RFs involves diffusion on latent vectors and the use of a conditional NeRF that maps a latent code to 3D shapes. 3D-CLFusion [69] combines the CLIP encoder and diffusion model. This approach enables the model to use both images and text as input conditions. Neuralfield-LDM [70] modifies the diffusion model to be “hierarchical”, taking 1D, 2D, and 3D latent features simultaneously. It also trains an autoencoder to obtain these features from NeRFs.

3.2.3 2D-to-3D

2D-to-3D models are a “special” category of models, which basically take advantage of the publicly available large pre-trained diffusion models. All of these methods can inherit all the features of the large DMs such as multi-modal information and generate almost all types of objects. The pioneering work of 2D-to-3D models is Dreamfusion [14]. To distill 2D generation models to 3D, Dreamfusion needs to train a NeRF for every generated object by minimizing the SDS loss. The SDS loss takes the gradient of the U-Net out of the original diffusion loss, which is proven to be effective. However, multiple problems associated with SDS loss have been observed: 1) The method needs to optimize a NeRF every time it generates a shape, limiting the generation speed. 2) The underlying stable diffusion does not have pose priors, introducing ambiguity to the generated geometry. This may cause multiple artifacts such as the “Janus Problem”, low-detailed geometry, or over-smooth geometry. 3) The appearance problem. In order to make the network with SDS loss converge, the guidance weight of the SDS is set high. This can lead to overly saturated colors in the generated shapes, making them look “cartoonish” and unrealistic. To address these problems, several improvements have been proposed. Latent-NeRF [71] uses a feature space NeRF instead of an image space to better connect stable-diffusion [15], which is the state-of-the-art guidance model used in 2D-to-3D methods. Perp-Neg [99] attempts to solve the “Janus problem” by using negative prompts in the diffusion model to make it faithfully produce images with desired views. SJC [72] has proposed another loss term that is similar but not identical to SDS and provides a clearer deduction of the loss. ProlificDreamer [73] replaces the SDS loss with the variational score distillation (VSD) loss and proves that the VSD loss is more generic and produces high-quality shapes. This modified version of SDS loss can partially solve the low-detail or over-smooth problem of dreamfusion. Magic3D [74] extends Dreamfusion to a two-stage coarse-to-fine approach using a mesh prior, improving the generation quality. Fantasia3D [75] also leverages a mesh prior and a two-stage pipeline, and further decomposes the material into PBR components. The two-stage methods can serve as a solution to the unrealistic appearance problem mentioned earlier, as they can individually optimize the appearance, reducing the need for 3D guidance. It can also potentially solve the geometry problem since the extracted mesh can serve as a template and a strong prior, making it easy for the diffusion model of refinement stages to guide optimization. Apart from the pure text-guided version, a single image or a few images’ prior conditions can be applied. NeRDi [76] generates 3D shapes from diffusion prior and single image, but it relies on an “inverse process” by narrowing priors from visual cues and textual descriptions. Dream3D [77] uses both CLIP and diffusion model priors for generation. Dreambooth3D [78] uses a three-stage strategy that combines text-to-image and text-to-3D methods to gradually refine the generated NeRF. Zero-1-to-3 [79] tries to solve the ambiguity pose problem by controlling the camera pose of generated views via diffusion models. Make-It-3D [100] also uses a two-step strategy: First, it transforms the single image with an estimated depth predicted by off-the-shelf methods into a radiance field and then uses a diffusion model prior to refine the geometry. 3DFuse [80] improves the 3D consistency of generated shapes by feeding the diffusion model with a generated depth map. Apart from generating a single shape, scenes containing multiple shapes can be generated by 2D-to-3D methods. Po and Wetzstein [95] considered the text input and bounding boxes of multiple objects at the same time. The bounding boxes are used as masks in rendering, and every object is “merged” using the masks, and the whole image is used to compute the SDS loss. CompoNeRF [96] also utilizes a bounding-box-based scene composition convention, but it applies SDS loss to both the local (inside bounding box) and global geometry. The local radiance fields are “projected” to a global MLP in the joint training process. Other works have focused on generating dynamic scenes or human avatars using 2D-to-3D methods. DreamTime [81] generates dynamic scenes using timestep sampling with non-increasing functions when optimizing NeRF from the SDS. DreamAvatar [83] utilizes SMPL [101] parameters and text input, and uses SDS loss for both canonical space and observation space to generate a NeRF model.

3.2.4 Summary

The NeRF algorithm was the origin of the recent “explosion” of 3D shape generation work. It is easy to see that NeRF can fuse the geometry and appearance of objects into neural networks while preserving quality. Due to its simplicity, one can combine NeRF with various generator architectures and obtain decent results. However, NeRF also has some disadvantages. Even with multiple acceleration methods such as instant-ngp [102], it is still difficult to render NeRF completely as a mesh in real time, which affects the performance of various 2D-to-3D generation methods. The all-MLP architecture of NeRF also poses challenges in the case of editing, filtering, and post-processing. This difficulty also prevents subtle control when generating them. Finally, the relatively high dimensionality (5D) of the NeRF input can cause artifacts like floaters due to overfitting.

3.3 Triplane-based shape generation

As a representation, triplanes are newly introduced by EG3D [13], which is dedicated to the generation of high quality human head geometry. EG3D proves that the 3D data of radiance fields can be effectively compressed into three 2D feature maps, which can be directly generated by StyleGAN [22]. During rendering, three features from the maps are sampled from feature planes, concatenated, and fed into downstream networks. Exploring the potential of triplane representation has recently been a popular topic. In Table 3, we briefly categorize shape-generating methods based on triplanes according to the generating architecture and types of the generated results.

Table 3 Overview of works based on triplanes according to the generator and the generated results

3.3.1 GANs, VAEs and 2D-to-3D

In the pioneering work on triplanes, EG3D [13] uses a StyleGAN [22] structure to generate the triplane and standard NeRF-like volume rendering techniques. Noguchi [115] extends EG3D and its GAN+triplane method to articulated humans. Avatargen [116] leverages SMPL [101] prior to generate controllable humans. EpiGRAF [103] removes the upsampler of EG3D and trains the GAN in patches to improve the fidelity of generation. IDE-3D [118] extends EG3D by utilizing different geometry and texture codes, and employs a sophisticated generator, encoder, and GAN inversion techniques. NeRFFaceEditing [119] further enhances IDE-3D. This approach enables fine-grained editing of generation results by utilizing appearance codes and semantic masks. DATID-3D [120] aims to transfer EG3D to another domain, e.g., animation. They use a pre-trained text-to-image model to generate a new dataset and use the refined dataset to transfer the underlying EG3D model. PODIA-3D [121] extends DATID-3D. They modify the diffusion model to make it pose-aware and use a debiasing module based on text. GET3D [104] generates a triplane over a GAN and then replaces the MLP to make it predict texture values, and the geometry is generated via a tetrahedron-based proxy mesh. Finally, they use DMTet [18] to extract triangular mesh from the SDF. SinGRAF [130] generates a 3D shape from the pattern of a specific scene, providing a single image of the scene. Next3D [122] extends the EG3D to generate animated faces. It uses two triplanes, one of which is used to deform the static geometry. Moreover, PV3D [123] generates dynamic videos. It extends triplanes, and separates appearance codes and motion codes, ensuring consistent motion. MAV3D [105] extends 2D-to-3D generation to dynamic scenes. They use the “hexplane” representation, incorporating the time axis. SDS loss is applied to both the static image and the dynamic video. TAPS3D [106] extends GET3D to allow text-to-3D generation via a caption generation module and CLIP. Geometry and texture are modeled with different triplanes. Skorokhodov et al. [107] considered arbitrary cameras and utilized depth priors. This approach can generate more diverse and challenging datasets such as ImageNet [131]. LumiGAN [124] generates geometry and albedo, specular tint, and visibility at the same time using triplane representation, and uses SH light to render the head instead of using the original NeRF method. NeRFFaceLighting [125] uses separate shading, geometry and albedo triplanes, in which the shading triplane is conditioned on SH lighting. The lighting and rendering are separately fed into discriminators. A regularization method is used to enhance generalizability of the algorithm. PanoHead [126] generates 3D human heads from 360° full head images, using a self-adaptive image alignment and a tri-grid volume to solve the “mirrored face” artifact of EG3D. Head3D [127] utilizes a teacher-student distillation technique and a dual-discriminator structure to solve the front-back gap for full-head generation present in EG3D-based methods. GINA-3D [108] decouples representation learning and generation, and uses VAE to map input images to latent feature represented by triplanes using quantization, cross-attention, and neural rendering. Trevithick et al [128] generated 3D heads with a single image prior at real-time speed, eliminating the costly generator at inference time. Additionally, they used encoders and Vit modules to generate triplanes. AG3D [117] separately generates canonical humans and poses (via deformation), and uses multiple discriminators of normal and rendered images (there are also different discriminators of whole body and face). Zhu et al [109] based their method on GET3D with the aim of applying the trained model to another domain using silhouette images.

3.3.2 Diffusion models

Recently, diffusion models have received considerable attention because of their ability to generate high-quality 2D data. On the other hand, triplanes can compress 3D data into 2D data, making the combination of DMs and triplanes natural. RenderDiffusion [110] uses the diffusion model as the backbone and a single image as a condition. For every denoising step in the diffusion model, the image is encoded into a triplane and rendered back to a denoised image via volume rendering. Rodin [129] is another pioneering study that proposes a roll-out diffusion network that can perform 3D-aware diffusion and take advantage of multi-modal conditions. 3DGen [111] is an extension of Rodin that uses a VAE to obtain a latent space and a diffusion model to generate latent features. NerfDiff [112] uses a camera-aligned triplane to solve the ambiguity in depth. The rendered images are fed from the generated triplane back into the diffusion model to improve the generation quality. SSDNeRF [113] directly applies the diffusion model to triplane representations. They jointly learn the diffusion model and a decoder that can render the triplane into a NeRF, and the joint learning process enables single-view reconstruction. Gu et al. [114] combined components of VAE, GAN, and diffusion models, using a GAN to learn a latent code triplane and train a diffusion model on this triplane. The model can use both “condition” via the encoder and “guidance” via the diffusion model.

3.3.3 Summary

In contrast to the 3D nature of SDF and NeRF, the triplane compresses 3D data into three 2D planes and proves that these planes contain most of the information needed to reconstruct high-quality 3D shapes. This approach enables various methods to leverage 2D generators for 3D data, increasing generation speed and simplifying the design of the pipeline. However, in addition to the high quality of generation in the areas of human faces and avatars, which come with strong priors, the quality of general objects generated by triplanes seems slightly lower. There is also a lack of work dedicated to the generation of large-scale and multi-object scenes based on triplanes. It may take more time for researchers to realize the full potential of the triplane as an individual representation.

4 Discussion

After reviewing recent work on implicit representation-based 3D shape generation, we will now discuss some of the open problems and future directions in this area.

4.1 3D shape generation with higher quality

Implicit representations of 3D shapes often utilize a learned function to cover all the details of the geometry. However, this pipeline still struggles to generate high-fidelity fine-grained geometry. This is caused by both the limitations of the representation itself (the low-frequency pass feature of the MLP network and the inherent ambiguity of implicit representation) and the design of the architecture (the difficulty of designing the discriminator of the GAN, the “blurry” generation of VAE and the high computational cost of diffusion). Despite the problem of geometric quality, the appearance of the generated shapes also needs improvement. Radiance fields or 2D-to-3D generation architectures seem to be good choices for generating appearances along with geometry, but methods to thoroughly control the appearance of generation are still lacking.

4.2 Faster 3D shape generation

Radiance fields have various advantages as a representation. However, the rendering speed of RFs is not very satisfactory: it takes minutes to render a view of vanilla NeRF. In terms of generation speed, early work directly generating RFs or SDFs by using GAN or VAE can perform tasks at a relatively high speed, but the quality is undesirable. Recent work that is based on a 2D-to-3D generation pipeline usually needs to optimize a radiance field for each generated shape, taking approximately 30 minutes for a single shape. The use of three planes may provide a good balance between speed and quality, but rendering triplane-based shapes also rquires ray marching at a high cost. Generating and rendering implicit 3D shapes at real-time speed is still an open problem.

4.3 3D shape generation on a larger scale

The difficulty of generating large-scale implicit 3D shapes is twofold: 1) Implicit representation is obviously not a satisfactory choice for large scenes. They often require some sort of “range for variables” to make the function easier to learn, making it difficult to balance between the range and the scene scale. 2) The generation pipeline for large scenes is difficult to design, since it requires a holistic understanding of the scene, preventing access to the usual local patch-based method. It is useful to generate large scenes since downstream applications such as robotics or autonomous driving require this type of scene. Splitting the large scenes into smaller scenes [96] may be a solution, and we believe that there is research to be done in this area.

4.4 Combination with other representations

This survey focuses on implicit representations, but other representations such as mesh, point cloud, or structure-based and “procedural” [132] representations also have significant advantages. Recently, several studies such as Neumesh [133] or DE-NeRF [134] have combined implicit and explicit representations to improve editing. These works requires a mesh for input, which can be obtained by off-the-shelf reconstruction methods [21] or from priors such as SMPL mesh [101]. This combination can provide implicit representations with topological priors, and utilize both higher geometry quality and better local editing. In the field of 3D shape generation, works like GET3D [104] also try to utilize traditional “mesh and texture” for generation and achieve good performance by transforming implicit representations to explicit mesh using differentiable middlewares such as deep marching tetrahedra [18]. In general, the combination of multiple representations for 3D content creation is a topic that is well worth exploration.

5 Conclusion

This survey has reviewed recent advances in 3D shape generation methods based on implicit representations. We begin with an introduction to the most commonly used implicit representations and generation architectures. We then review the recent work on implicit representation based 3D shape generation in detail. We categorize these studies according to the type of 3D representation they use, including signed distance fields, radiance fields, and triplanes. We have also included a brief timeline of the development of 3D shape generation based on implicit representations and highlighted key work in the literature. Finally, we discuss some of the aspects of current work that need to be improved in future research. It is hoped that this survey will provide some insights for other researchers and inspire future work in this area.