Multi-view Consistent Generative Adversarial Networks for Compositional 3D-Aware Image Synthesis

This paper studies compositional 3D-aware image synthesis for both single-object and multi-object scenes. We observe that two challenges remain in this field: existing approaches (1) lack geometry constraints and thus compromise the multi-view consistency of the single object, and (2) can not scale to multi-object scenes with complex backgrounds. To address these challenges coherently, we propose multi-view consistent generative adversarial networks (MVCGAN) for compositional 3D-aware image synthesis. First, we build the geometry constraints on the single object by leveraging the underlying 3D information. Specifically, we enforce the photometric consistency between pairs of views, encouraging the model to learn the inherent 3D shape. Second, we adapt MVCGAN to multi-object scenarios. In particular, we formulate the multi-object scene generation as a “decompose and compose” process. During training, we adopt the top-down strategy to decompose training images into objects and backgrounds. When rendering, we deploy a reverse bottom-up manner by composing the generated objects and background into the holistic scene. Extensive experiments on both single-object and multi-object datasets show that the proposed method achieves competitive performance for 3D-aware image synthesis.

(1) Existing approaches Chan et al., 2021;Niemeyer & Geiger, 2021) do not guarantee geometry constraints between views, hence usually failing to generate multi-view consistent images in some views. (2) As pointed  (Karras et al., 2018) dataset. We render multi-view images at resolution 512 2 from different viewpoints by Schwarz et al. (2020), current methods do not work well on scenes that contain multiple objects with complex shapes and diverse backgrounds.
In this paper, we address the first problem by proposing MVCGAN, a multi-view consistent generative model with geometry constraints (see Fig. 1).
Here we present typical failure cases of existing approach (Niemeyer & Geiger, 2021) in Fig. 2. We identify the cause of the inconsistent phenomenon between views: previous methods optimize a single view of the generated image independently while ignoring the geometry constraints between views (see Sect. 3.2.1). To tackle this problem, the proposed MVCGAN takes inspiration from classical multi-view geometry methods (Zhou et al., 2017;Godard et al., 2019) to build geometry constraints across views. Specifically, we perform multi-view joint optimization by enforcing the photometric consistency between pairs of views with re-projection loss and integrating a stereo mixup mechanism into the training process. Therefore, the generator not only learns the manifold of 2D images but also ensures the geometric correctness of the underlying 3D shape. Besides, we notice that NeRFbased generative approaches Chan et al., 2021;Niemeyer & Geiger, 2021) typically struggle to render high-resolution images with fine details due to the huge computational complexity of NeRF model (Mildenhall et al., 2020). Therefore, we adopt a hybrid MLP-CNN architecture, which contains one MLP-based NeRF model and one CNN-based decoder. Specifically, the MLP-based NeRF model (Mildenhall et al., 2020) renders the geometry of the 3D shape, and the subsequent CNN-based decoder produces fine details for the 2D appearance. The structure can gen-erate photorealistic high-resolution images while alleviating the computation-intensive problem.
We further adopt MVCGAN to multi-object and backgroundattached scenarios with a compositional framework, MVC-GAN+. In specific, MVCGAN+ employs two MVCGAN branches to model the foreground objects and backgrounds separately. Besides, we propose a "decompose and compose" scheme to perform the complex scene generation in a topdown and bottom-up manner. During training, we explicitly incorporate the object masks to decompose the objects and backgrounds from the training images. The disentanglement of objects and backgrounds allows us to impose geometry constraints on the foreground object and the background separately. When rendering the whole scene, we compose the objects and backgrounds via object masks and occlusion relations. In summary, our main contributions are summarized as follows: 1. We identify the cause of the multi-view inconsistency in 3D-aware image synthesis, and propose to incorporate geometry constraints into the generative radiance field for the single-object scene generation. 2. To tackle complex multi-object scenes, we further scale MVCGAN to a compositional framework with top-down and bottom-up manners. To our knowledge, we are among the early attempts to incorporate instance masks into generative radiance fields to tackle complex multi-object scenarios. 3. We demonstrate the effectiveness and scalability of the proposed approach by evaluating on both single-object and multi-object datasets. Extensive experiments substantiate that our method achieves competitive performance for 3D-aware image synthesis.
This paper is an extension of our previous conference version (Zhang et al., 2022). Compared to the preliminary version, this work includes the following new contents. (1) Owing to the inadequate exploration of complex multi-object scenes in current works, we scale MVCGAN (Zhang et al., 2022) to a compositional framework, MVCGAN+, for multi-object 3D-aware image generation. In particular, we model the foreground objects and backgrounds with two separate branches. (2) By incorporating the easily-obtained 2D annotations, i.e., instance masks and bounding boxes, we formulate the multi-object image generation as a "decompose and compose" process. To our knowledge, we are among the first attempts to incorporate instance masks into generative radiance fields to tackle the multi-object generation problems.
(3) To further validate the competence of our method, we add more experiments and discussions for ablation studies and visualization results.  (Niemeyer & Geiger, 2021) as an example, the generated images in the first row have obvious inconsistent appearance artifacts between views, such as the direction of hair (blue box) and the opening mouth (green box). Besides, we notice that GIRAFFE (Niemeyer & Geiger, 2021) suffers collapsed results under large pose variations (see the leftmost and rightmost pictures in the first row), which indicates that the model does not learn an appropriate 3D shape. In contrast, our method generates high-quality images with multi-view consistency (see the second row) (Color figure online) 2 Related Work

3D-Aware Image Synthesis
Generating photorealistic and editable image content is a long-standing problem in computer vision and graphics. In the past years, generative adversarial networks (GAN) (Goodfellow et al., 2020) have demonstrated impressive results in synthesizing high-resolution images of high quality from unstructured image collections (Zhu et al., 2017;Brock et al., 2018;Choi et al., 2018;Karras et al., 2018;Huang & Belongie, 2017;Karras et al., 2019;Zheng et al., 2019;Karras et al., 2020;Choi et al., 2020). Despite the tremendous success, most of the methods typically only learn the manifold of 2D images while ignoring the 3D representation of the scene. In recent years, several recent works have investigated how to incorporate 3D representation into generative models (Alhaija et al., 2018;Nguyen-Phuoc et al., 2019;Zhu et al., 2018;Liao et al., 2020;Nguyen-Phuoc et al., 2020;Henderson et al., 2020;DeVries et al., 2021). Nguyen-Phuoc et al. (2019) combine a strong inductive bias about the 3D world with deep generative models to learn disentangled representations of 3D objects. Nguyen-Phuoc et al. (2019) provides control over the pose of generated objects through rigid-body transformations of the learned 3D features. Schwarz et al. (2020) propose GRAF, generative radiance fields for 3D-aware image synthesis from unposed 2D images. pi-GAN (Chan et al., 2021) adopts a SIREN-based neural implicit representation with periodic activation functions as the backbone of the generator. GIRAFFE (Niemeyer & Geiger, 2021) represents scenes as compositional generative neural feature fields. ShadeGAN  models the illumination to 123 regularize the training process. Combining the occupancy representation with radiance fields, Xu et al. (2021) introduce Generative Occupancy Fields (GOF) to shrink the sample region of the volume rendering process. StyleNeRF (Gu et al., 2022) integrates NeRF (Mildenhall et al., 2020) to the StyleGAN-like generator (Karras et al., 2019(Karras et al., , 2020 to close the gap between 2D and 3D GANs. Zhou et al. (2021) extend CIPS (Anokhin et al., 2021) to CIPS-3D, a 3D-aware generator that composes of NeRF and implicit neural representation network. StyleSDF (Or-El et al., 2022) achieves high-resolution image genearation and 3D surface modeling by integrating the SDF-based 3D representation into the 2D style-based generative model (Karras et al., 2019(Karras et al., , 2020. Recently, Chan et al. (2022) introduce a novel triplane representation with 3D inductive bias, resulting in a more efficient and expressive 3D GAN framework, EG3D. VolumeGAN (Xu et al., 2022) learns a structural and textural representation with a 3D feature volume and neural renderer respectively. Deng et al. (2022b) reduce the number of sampling points by learning generative 2D manifolds (GRAM), while GRAM-HD (Xiang et al., 2022) achieves better results by performing super-resolution in the 3D space. VoxGRAF (Schwarz et al., 2022) explores sparse voxel grid representations to accelerate training. Skorokhodov et al. (2022) redesign the patch-based discriminator to improve the optimization scheme of 3D generative adversarial networks. However, these methods typically optimize a single view of the generated scene independently and ignore the underlying geometry constraints across views.

Methodology
Our goal is to generate photorealistic high-resolution images with explicit control over the camera pose while maintaining multi-view consistency. We now present the main components of the proposed method. First, we briefly review the background of NeRF-based generative adversarial networks Niemeyer & Geiger, 2021;Chan et al., 2021) and identify the limitations of previous methods (see Sect. 3.1). Second, we analyze the cause of the multi-view inconsistency problem and present Multi-View Consistent Generative Adversarial Networks (MVCGAN) for single object generation (see Fig. 5 for an overview). At last, based on MVCGAN, we further introduce a compositional framework (MVCGAN+) for multi-object image generation in Sect. 3.3.

Preliminaries
Neural Radiance Fields. Neural radiance field (NeRF) synthesizes novel views of the scene by optimizing a fullyconnected network using a set of input views. The MLP Fig. 3 Visualization of shape-radiance ambiguity. For illustration, we assume p (the red dot) is the location of correct geometry, p 1 (the violet dot) and p 2 (the blue dot) are incorrect geometries. In the absence of geometry constraints, the model can fit incorrect geometry p 1 in view 1 and p 2 in view 2 independently to simulate the effect of the correct geometry p (Color figure online) network maps a continuous 5D coordinate (3D location x and 2D viewing direction d) to an emitted color c and volume density σ (Mildenhall et al., 2020): where γ indicates the positional encoding mapping function.
To render the neural radiance field from a viewpoint, Mildenhall et al. (2020) use classic volume rendering to accumulate the output colors c and densities σ into an image. Generative Radiance Fields. Generative neural radiance fields aim to learn a model for synthesizing novel scenes by training on unposed 2D images. Schwarz et al. (2020) adopt an adversarial framework to train a generative model for radiance fields (GRAF). The generative radiance field is conditioned on a shape code z s and an appearance code z a : Following GRAF , Niemeyer and Geiger (2021) introduce a compositional generative neural feature field (GIRAFFE). Inspired by StyleGAN (Karras et al., 2019), Chan et al. (2021) instead propose periodic implicit generative adversarial networks (pi-GAN) with feature-wise linear modulation (FiLM) conditioning. Limitations. We notice two limitations of existing approaches Niemeyer & Geiger, 2021;Chan et al., 2021). First, they do not guarantee geometry constraints between different views. Consequently, they usually suffer from collapsed results under large pose variations or have obvious inconsistent artifacts across views. Second, these approaches mostly cannot tackle the scene that contains multiple objects and complex backgrounds. Illustration of the warping process. For each pixel v pri in the primary image I pri , we first calculate the location of v aux (the corresponding pixel of v pri in the auxiliary image I aux ) based on the depth value D(v pri ) and camera transformation matrix [R, t]. Then we can reconstruct the pixel v pri of the warped image I war p from the primary view using the value of pixel v aux . We observe that the warped image has a wrong appearance, which verifies the incorrect geometry shape learned by the model

Image-Level Multi-view Joint Optimization
Shape-radiance Ambiguity. In this part, we analyze the cause of the multi-view inconsistency problem in NeRFbased generative models. We observe that optimizing the radiance fields from a set of 2D training images can encounter critical degenerate solutions in the absence of geometry constraints. This phenomenon is referred to as shape-radiance ambiguity (Zhang et al., 2020), in which the model can fit the training images with inaccurate 3D shape by a suitable choice of radiance field at each surface point (see Fig. 3). To better illustrate the shape-radiance ambiguity, we warp the rendered images from view 1 to view 2 based on the underlying depth and camera transformation matrix [R, t] (see the details of warping process in Fig. 4 and Eq. 4). We find the warped image shows a wrong appearance, which verifies the assumption of degenerate solutions to the learned 3D shape. To avoid the shape-radiance ambiguity, NeRF (Mildenhall et al., 2020) requires a large number of posed training images from different input views for the scene. However, generative radiance fields have neither annotated camera poses nor sufficient multi-view images in the training dataset. Consequently, the generative model can synthesize reasonable images in some views but produce poor renderings in other views (see Fig. 2). Warping Process. To alleviate the shape-radiance ambiguity (Zhang et al., 2020), we propose to establish multi-view geometry constraints (Chen & Williams , 1993;Debevec et al., 1996;Andrew, 2001;Seitz & Dyer, 1996;Zhou et al., 2017;Godard et al., 2019) via the warping process between views. First, following pi-GAN (Chan et al., 2021), we adopt a style-based generator that contains a synthesis network G s (a SIREN-based (Sitzmann et al., 2020;Chan et al., 2021) generative radiance field) and a mapping network G m (a simple MLP network with ReLU) (see Fig. 5). Given a latent code z ∈ R 256 in the input latent space Z, the mapping network G m :Z − → W can produce the intermidiate latent w ∈ R 256 , which controls the synthesis network G s at each layer. Second, instead of only optimizing a single view independently, we aim to optimize multiple views jointly to maintain the 3D consistency across views. As shown in the left of Fig. 5, we randomly sample two camera poses, i.e., the primary pose ξ pri and the auxiliary pose ξ aux , from the pose distribution p ξ . Taking ξ pri and ξ aux as input, the generative model G s synthesizes two views of the generated images separately: the primary image I pri and the auxiliary image I aux . Then we can build geometry constraints between ξ pri and ξ aux via image warping, which reconstructs the primary view by sampling pixels from the auxiliary image I aux . Specifically, for each point v pri in the primary image I pri , we first find the corresponding pixel v aux in the auxiliary image I aux through the stereo correspondence, and then reconstruct the pixel v pri of the warped image I war p from primary view using the value of v aux (see Fig. 4). Next, we present a detailed calculation procedure of the warping process. The stereo correspondence is calculated based on the depth map D of the primary image and camera transformation matrix from ξ pri to ξ aux . The depth can be rendered in a similar way as rendering the color image (Mildenhall et al., 2020;Deng et al., 2022a). Given the pixel v pri from the primary view, the depth value D(v pri ) is formulated as: where N is the number of samples in the camera ray, δ i = d i+1 − d i is the distance between adjacent sample points and σ i is the volume density of sample i (refer to Mildenhall et al. (2020); Deng et al. (2022a) to see more details). With the depth value D(v pri ), we can obtain the homogeneous coordinates h pri of pixel v pri in the primary camera coordinate system through perspective projection. Then the projected coordinates h aux in the auxiliary view can be calculated as: where the camera intrinsics K are known parameters and the camera transformation matrix [R, t] can be calculated from the primary pose ξ pri and the auxiliary pose ξ aux . Finally, we can reconstruct the pixel v pri in the warped image I war p from the primary view using the value of pixel v aux (located in h aux of I aux ). Image-level Joint Optimization. After obtaining the warped image I war p , we perform image-level multi-view joint optimization by enforcing the photometric consistency and 123 Fig. 5 The generator of MVCGAN. During training, the generative radiance field network G s takes primary pose ξ pri and auxiliary pose ξ aux as input. The mapping network G m maps the input latent z to intermediate latent w, which conditions both the generative radiance field network G s and the progressive 2D decoder G d . In Stage I, we directly render primary image I pri and auxiliary image I aux with the color and density output from G s . Then we perform image-level multi-view joint optimization and output a low-resolution RGB image (64 2 ). In Stage II, we instead use volume rendering to accumulate 2D feature maps at low resolution (64 2 ), and then perform multi-view optimization at the feature level. The progressive 2D decoder G d upsamples 2D feature map F mi x to a high-resolution RGB image (128 2 , 256 2 , 512 2 ) for fine 2D details. During inference, only the primary pose is required without auxiliary pose (the dotted lines do not participate in inference) employing a stereo mixup module (see Fig. 6). To satisfy the geometry constraints between views, we enforce the photometric consistency across views by minimizing the re-projection loss between the primary image I pri and the warped image I war p . Following the common practice in image reconstruction (Wang et al., 2004;Zhao et al., 2016;Pillai et al., 2019;Zhou et al., 2017;Godard et al., 2019;Lyu et al., 2021), we formulate the image-level re-projection loss as the combination of L1 (Zhao et al., 2016) and SSIM (Wang et al., 2004): where SSIM is a perceptual metric of image structural similarity and μ = 0.85 empirically. In addition to being similar to the primary image, the warped image should also look like a real image. A straightforward method is introducing two discriminators. One is to compare the warped image I war p with an arbitrary real image sampled from the training dataset, and the other one compares the primary image I pri . However, introducing extra modules can increase the computation complexity. Inspired by the mixup strategy , we instead propose a stereo mixup module to optimize both I pri and I war p by constructing a virtual mixed image:

Fig. 6
Image-level multi-view joint optimization. We enforce the photometric consistency between the primary image I pri and the warped image I war p by minimizing the image-level re-projection loss L ir . Besides, we integrate a stereo mixup module to encourage the warped image to be similar to a real image. The dotted line does not participate in the inference stage where η is a dynamic number randomly sampled from the range of [0, 1] in every training iteration, and I mi x is the input of the discriminator. It is worth noting that the auxiliary pose is introduced to construct the geometry constraints, and is thus only required in the training process. In the inference stage, the generative model only takes the primary pose ξ pri and latent code z as input to generate the primary image directly.

Feature-Level Multi-view Joint Optimization
In practice, we also encounter one practical challenge: NeRFbased generative models Niemeyer & Geiger, 2021;Chan et al., 2021) typically struggle to render high-resolution images with fine details due to the huge computational of NeRF (Mildenhall et al., 2020) model. To render images with both fine 2D details and correct 3D shape, we design a two-stage training strategy and extend multiview optimization to the feature level. We begin training at a low resolution (64 2 ) in Stage I, and then increase to high resolutions (128 2 , 256 2 , 512 2 ) in Stage II (see Fig. 5).
In Stage I, we directly render primary and auxiliary images with the color and density output from the generative radiance field network G s . With the guidance of geometry constraints, we perform image-level multi-view joint optimization to enhance the geometric reasoning ability of the model. In Stage II, to alleviate the computation-intensive problem of rendering high-resolution images, we instead train the model via feature-level multi-view optimization for better visual quality. First, we adopt a hybrid MLP-CNN architecture to disentangle the geometry of the 3D shape from fine details of 2D appearance. Then we generalize volume rendering (Niemeyer & Geiger, 2021) to the feature level by rendering 2D primary feature map F mi x at low resolution (64 2 ): where f i ∈ R 256 is the feature before the final layer of G s , and other symbols are defined in Eq. 3. The auxiliary feature map F aux is rendered in the same way as F pri , and the warped feature map F war p can be obtained through the warping process. Second, we perform multi-view feature-level joint optimization on low-resolution feature maps (64 2 ). To enforce the geometry consistency in the feature space, we take the implicit diversified Markov Random Fields (MRF) loss (Wang et al., 2018) as the feature-level re-projection loss: which can encourage the model to capture high-frequency geometry details (Feng et al., 2021). Then the stereo mixup mechanism is also applied to the 2D feature maps: F mi x = ηF pri +(1−η)F war p . Third, we increase the resolution with a style-based 2D decoder (Karras et al., 2019) G d , which takes F mi x as input and then upsamples to high-resolution RGB image (see Fig. 7). The 2D decoder G d is conditioned by the mapping network G m through adaptive instance normalization (AdaIN) (Huang & Belongie, 2017;Dumoulin et al., 2020;Karras et al., 2019). As training progresses, we adopt the progressive growing strategy to grow the generator for higher resolution (Karras et al., 2018). When new layers are added to G d , we use skip connections to fade the inserted layers in smoothly to stabilize and speed up the training process (Karras et al., 2018(Karras et al., , 2020.

MVCGAN+: Towards Multi-Object Generation
While remarkable results have been achieved on 3D-aware image generation, existing methods Chan et al., 2021;Deng et al., 2022b;Gu et al., 2022;Xu et al., 2021;Pan et al., 2021;Chan et al., 2022) mostly focus on the scene with a single object in the center, and do not work well on multi-object scenes. At present, only GIRAFFE (Niemeyer & Geiger, 2021) considers the compositional properties of scenes and allows for multi-object image generation. However, GIRAFFE (Niemeyer & Geiger, 2021) learns the compositional generative feature fields in an unsupervised manner, which is infeasible to decompose the scene into individual objects precisely. The lack of appropriate supervision makes GIRAFFE (Niemeyer & Geiger, 2021) can only be verified on simple synthetic data, i.e., CLEVR (Johnson et al., 2017), while more realistic scenes with complex geometry shapes and diverse textures still remain unexplored.
To extend to scenarios with multiple objects and backgrounds, we further propose MVCGAN+, a two-branch framework with extra supervision (see Fig. 8 for an overview). We formulate the multi-object scene generation as a "decompose and compose" process. During training, MVCGAN+ learns the whole scene via a top-down decomposition man- Fig. 8 An overview of MVCGAN+. I. Decomposition Phase. We adopt a "top-down" strategy to train the object branch and the background branch. Specifically, we decompose the real images into foreground objects and backgrounds via masks and bounding boxes. Then we impose multi-view constraints to optimize the object generator G obj and the background generator G bg individually. Two discriminators, i.e., D obj and D bg , are employed to perform adversarial training on generated images and real images. II. Composition Phase. We deploy a reverse "bottom-up" manner for rendering. We first generate foreground objects images and background images with the object branch and the background branch respectively. Then the whole image can be composed with object masks and occlusion relations ner. Specifically, we incorporate easily-accessed 2D annotations, i.e., object bounding boxes and instance masks, into training to disentangle objects and backgrounds. MVC-GAN+ contains one object branch with G obj and D obj , one background branch with G bg and D bg . For the object branch, we randomly select a single object from the whole scene and crop the corresponding patch with the masked backgrounds as the real object image (see Fig. 8). We encourage the object generator G obj to model the foreground object while leave the background region with empty space. One problem remains that the content of unbounded and occluded scenes, e.g., masked backgrounds, can locate at any distance of the ray. Due to the inherent ambiguity of 2D-to-3D correspondence, the object generator can generate arbitrary geometry outside the target object regions. Consequently, there may exist some semi-transparent materials floating in the space and cause cloudy and foggy artifacts when viewed from another angle. Therefore, we add the sum of the color weights along the ray on the accumulated color to suppress the low-density areas: where c(r) is the accumulated color of ray r by volume rendering, c white = 1 is the color of the white background (the value of white color equals to 1 in the normalized color is the sum of weights of sampled color along the ray r (see more details in Eq. 3 and Eq. 5 of the original NeRF paper (Mildenhall et al., 2020)), and other symbols are defined in Eq. 3.
For the background branch, we follow NeRF++ (Zhang et al., 2020) to use an additional network G bg to model the complex backgrounds. As shown in Fig. 8, we remove all the foreground objects and fill the holes with the image inpainting methods (Telea, 2004). Considering the layout and geometries of the background environment are relatively simple, we can easily inpaint the occluded areas by searching the patches with similar textures from surrounding regions. In this way, MVCGAN+ models the objects and backgrounds individually by leveraging the information of object bounding boxes and instance masks. The disentanglement of the objects and backgrounds allow us to impose the multi-view geometry constraints on the object and the background branch separately.
In the composition phase, to compose the generated objects and backgrounds into a coherent scene, we first perform object arrangements and then reason about the geometry Quantitative comparisons with best results are given in bold We calculate FID between 20,000 generated and real images. "OOM" represents the out-of-memory error, and "Fail" denotes the model fail to converge relations between foreground objects and the backgrounds. For the object placement, we follow GIRAFFE (Niemeyer & Geiger, 2021) to transform the coordinate of the objectcentric space to the scene space with the rotation matrix R obj and the translation vector t obj : where k(x) is the transformed coordinate, t obj is the object location in the scene space. We generate the holistic image by performing alpha composition: where I f g is the rendered foreground object image, and I bg is the rendered background image. The foreground object mask ing to the accumulated density. For the overlapping areas between objects, we reason about the occlusion relations by combing 3D spatial locations of objects and depth values. Specifically, for every pixel in the render image, the object closest to the camera location will occlude other objects as well as the backgrounds.

Multi-object Datasets.
For multi-object scenes, existing datasets, e.g., CLEVR (Johnson et al., 2017), multi-dSprites (Matthey et al., 2017), Object Room (Burgess & Kim, 2018), Tetrominoes (Kabra et al., 2019), and CATER (Girdhar & Ramanan, 2019), typically contain objects with the simplest geometric shapes and plain backgrounds. Take a representative dataset CLEVR (Johnson et al., 2017) as an example, the scene contains 3 kinds of objects, i.e., cube, sphere, and cylinder, all of which are geometric primitives that have standard and symmetrical geometries. In this paper, we conduct experiments on a more complex and realistic dataset Room-chair (Yu et al., 2022b), which contains indoor scenes with chairs, walls, and floors. Specifically, we adopt the script (Yu et al., 2022b) to render 32,000 images at a resolution of 256 2 . To render chairs with diverse shapes, we choose 649 chair models from ShapeNet (Chang et al., 2015) library. For the backgrounds, we use 50 types of floors with different textures and materials, e.g., wooden floors. Each image contains a random number of chairs with a maximum number of 4. Besides, we also render the instance masks and obtain the object bounding boxes as the annotations. It is worth noting that the geometry of the chair is much more complex than other objects in ShapeNet (Chang et al., 2015) like cars and bowls, because chairs have many thin and fine structures such as backrests and legs. To our knowledge, Room-chair is the most challenging multi-object dataset we can find.

Training Details
We use a progressive growing convolutional discriminator D φ to compare the fake image produced by generator G θ and real image I sampled from the training data with distribution p D . For single-object generation, we train MVC-GAN using a non-saturating GAN objective with R 1 gradient penalty (Mescheder et al., 2018) and the proposed geometryconstrained objective L re as the total loss: where f (t) = −log(1+ex p(−t)), L re = L ir for Stage I (see Eq. 5), L re = L f r for Stage II (see Eq. 8), and λ = 10. We employ Adam optimizer (Kingma & Ba, 2015) with β 1 = 0, β 2 = 0.9, and the batch size of 56 for optimization. The initial learning rate is set to 6.0 × 10 −5 for the generator and 2.0 × 10 −4 for the discriminator, and decay over training to 1.5 × 10 −5 and 5.0 × 5 −5 respectively.
For the multi-object generation, we train the object generator G obj and the background generator G bg using the same Adam optimizer, the learning rate, and the batch size as the single-object generation. The main difference between the multi-object and the single-object generation is that we sample camera pose from different distributions due to the different scenes of training datasets (please refer to Sect. 1 of Appendix for the specific camera pose distribution of each dataset).

Comparison with SOTA
For quantitative comparison, we report Frechet Inception Distance (FID) (Heusel et al., 2017) to evaluate image quality. We compare our approach against five state-of-the-art 3D-aware image synthesis methods: GRAF , pi-GAN (Chan et al., 2021), GOF , ShadeGAN , GIRAFFE (Niemeyer & Geiger, 2021), and GRAM (Deng et al., 2022b). As shown in Table 1, our method consistently outperforms other methods Niemeyer & Geiger, 2021;Chan et al., 2021;Xu et al., 2021;Pan et al., 2021) on both single-object and multi-object datasets (Karras et al., 2018(Karras et al., , 2019Choi et al., 2020;Skorokhodov et al., 2022) by a large margin. Especially, on the Room-Chair dataset (Yu et al., 2022b), we observe most methods cannot handle the multi-object scenarios and fail to learn an appropriate generative model for the scene. In contrast, the extension MVCGAN+ can effectively render the compositional scenes with disentangled objects and backgrounds, outperforming GIRAFFE by a clear margin. To further demonstrate the effectiveness of the proposed method, we also visualize the generated images on single-object and multi-object datasets for qualitative comparison. As illustrated in Figs. 10 and 11, we render images from a wide range of viewpoints. On single-object datasets, we observe that GRAF , GIRAFFE (Niemeyer & Geiger, 2021) and pi-GAN (Chan et al., 2021) either fail to synthesize reasonable results under large view variations or have obvious multiview inconsistent artifacts. For multi-object scenarios, we note that GIRAFFE (Niemeyer & Geiger, 2021) suffers from collapsed results when the viewpoint changes. By comparison, our method achieves the best performance both in visual quality and multi-view consistency. Please refer to the appendix and the supplementary material 1 for more visualization results.

Ablation Studies
Image-level and Feature-level Optimization. We conduct ablation studies to help understand the individual contributions of image-level and feature-level multi-view joint optimization. From Fig. 12a, we observe that the generated images maintain the multi-view consistency under pose variations (FID = 22.5), indicating that image-level optimization can guide the model to learn a reasonable 3D shape. With feature-level optimization (see Fig. 12b), our approach can further improve the visual quality of generated images (FID = 13.7). As shown in Fig. 12, we note that the images generated by feature-level optimization have more delicate details, such as clear wrinkles, the highlight on the forehead, and the shadow of the cheeks. Multi-view Consistency. On the human face dataset (Karras et al., 2019), we take inspiration from Lin et al. (2022) to adopt the face identity preservation score (Face-IDS) to evaluate the multi-view consistency of generated images. For the portrait image animation and attribute-editing task Wu et al., 2022;Deng et al., 2020), the face identity preservation score (Face-IDS) can reflect how well the identity is preserved for the target image compared to the source image. Here we use Face-IDS to evaluate the multi-view consistency by measuring the similarity between different views. We first generate 1000 faces and render each face from two  (Yu et al., 2022b). We render the scenes from different camera view points Without the decomposition phase, the generated images will have poor object qualities and cannot disentangle objects and backgrounds random camera poses. Then, for each image pair of the same generated face, we calculate the cosine similarity of the predicted embeddings with a pertrained ArcFace network (Deng et al., 2019). The ArcFace similarity score has values between -1 and 1 (greater value means more similar, see Fig. 9 for examples). Finally, we compute the mean score of 1000 faces as the face identity preservation score (Face-IDS). Table 2, our method achieves the best face identity preservation score (multi-view consistency). We further conduct experiments to study whether increasing the number of auxiliary poses can improve the multi-view consistency or not. From Table 2, we observe that using more auxiliary poses leads to degenerated performance: the face identity preservation score (Face-IDS) decreases to 0.58 and 0.51 for 2 and 3 auxiliary poses respectively. We suspect the performance drop is caused by two reasons. First, since both the primary and auxiliary poses are randomly sampled from the camera pose distribution, sampling more poses cannot bring performance gain. Second, increasing the number of auxiliary poses brings much more GPU memory consumption, because the model has to perform volume rendering many times for one iteration. Consequently, we need to reduce the batch size to 8 and adjust the weight of the R 1 gradient penalty (Mescheder et al., 2018) (λ in Eq. 12). However, the decreased batch size affects the training stability and makes the model hard to converge, while the increased R 1 On FFHQ, we calculate the average face identity preservation score (ID) of generated images at 256 2 resolution regularization weight can hurt the overall performance (Karras et al., 2021;Mescheder et al., 2018).

Markov Random Fields Loss.
Previous papers (Feng et al., 2021) found ID-MRF loss can better capture high-frequency details than L1 loss in the 3D face reconstruction (Feng et al., 2021) and image reconstruction task (Wang et al., 2018). Therefore, we adopt Implicit Diversified Markov Random Field (ID-MRF) loss (Wang et al., 2018) to enforce the geometry consistency between views. We also conduct experiments to compare the effect of ID-MRF and L1 loss on multi-view consistency. Since there is no ground truth for the generated image, we adopt the face identity preservation score (Face-IDS) as the quantitative metric of multi-view consistency.
When using the vanilla L1 loss, we observe that the model still achieves similar multi-view consistency (face identity preservation score = 0.61) as the ID-MRF loss (face identity preservation score = 0.62). It seems that ID-MRF loss has no obvious advantages over L1 loss. We suspect that the problem is in the quantitative metric of multi-view consistency, because the face identity preservation score may not be able to capture these high-frequency details such as the wrinkles visualized in the Fig. 9 of Feng et al. (2021). As mentioned in the last paragraph (Multi-view Consistency), we compute the face identity preservation score (Face-IDS) with the Arcface cosine similarity (Deng et al., 2019). But the extracted embedding may lose the high-frequency and fine-grained details due to the pooling operation of the Arc-Face network (Deng et al., 2019). Therefore, the simple L1 loss can obtain a similar face identity preservation score as ID-MRF loss.

Decompose and Compose.
123 Fig. 16 Visualization of the COLMAP reconstruction (Schonberger & Frahm, 2016) from synthesized multi-view images The "decompose and compose" paradigm is essential in compositional image generation. If we directly generate multiple objects using the single object method and then compose them into a whole scene without the decomposition phase, the generated image will have poor object qualities and cannot disentangle objects and backgrounds (see the Fig. 13). This problem mainly comes from the discriminator, which plays a critical role in the training process of GANs. If there is no decomposition phase, we need to perform adversarial training with a scene-level discriminator between the rendered scenes and real images. In this case, the model will pay more attention to the global coherence of the whole scene, and neglect the supervision of individual objects. For a single object in the scene, the scene-level discriminator can provide weak learning signals on the object radiance field, because the object region only occupies a small proportion of the whole image. The inadequate training of the single object can lead to degenerated object quality. More importantly, the scene-level discriminator can not disentangle objects and backgrounds, making the background generator easily overfit the whole scene. In contrast, we deploy the decomposition phase to train the object branch and background branch individually. On the one hand, using two discriminators (the object discriminator D obj and the background discriminator D bg ) can provide sufficient supervision for objects, leading to a better quality of the generated objects. On the other hand, the disentanglement of objects and backgrounds allows us to control them separately, such as moving and rotating each object or the background. Scene Decomposition. We also investigate the disentanglement of foregrounds and backgrounds of MVCGAN+. As shown in Fig. 14, our method can decompose foreground objects and backgrounds from the holistic scene. The disentanglement allows us to control each object and the background individually. We can perform scene editing such as adding, moving, deleting, rotating, and changing individual objects or backgrounds. Please refer to the supplementary video for more visualization results. 3D Representation. To better illustrate the learned 3D representation, we visualize the underlying 3D shape from generated images with 3D reconstruction methods (Schonberger & Frahm, 2016;Lorensen & Cline, 1987). For the single-view 3D reconstruction, we adopt the marching cubes algorithm (Lorensen & Cline, 1987) to extract the underlying geometry of the generated image (see Fig. 15 for the visualized 3D meshes). To further demonstrate the multiview consistency of our method, we also perform multi-view 3D reconstruction (Schonberger & Frahm, 2016) to recover the 3D shape from generated multi-view images. Specifically, we first render images of a single instance from 35 views, and then perform dense 3D reconstruction by running COLMAP (Schonberger & Frahm, 2016) with default parameters and no known camera poses. The results in Figs. 15 and 16 validate the correctness of the 3D shape learned by our model. Style Interpolation. We also conduct style interpolation experiments to investigate the intermediate latent w learned by the mapping network G m . Given two generated images, we perform linear interpolation both in the intermediate latent space W and the camera pose space. As illustrated in Fig. 17, the smooth transition of both pose and appearance demonstrates that our model learns semantically meaningful intermediate latent space W. Shape-detail Disentanglement. Besides, we design a style mixing experiment to study what kinds of representations the generative radiance field G s and progressive 2D decoder G d learned respectively. Specifically, we input two latent codes z A and z B into the mapping network G m , and obtain the corresponding intermediate latent w A , w B in W space. Then we can generate style mixing images by applying w A and w B to control the different parts of the generator (G s and G d ). As shown in Fig. 18, we observe that controlling G s changes the 3D shape (identity and pose) while controlling G d changes 2D appearance details (colors of skins, hair, and beard). The results verify that the hybrid MLP-CNN architecture can disentangle the geometry of the 3D shape from fine details of the 2D appearance.

Conclusion
We present a multi-view consistent generative model (MVC-GAN) for compositional 3D-aware image synthesis. The key idea underpinning the proposed method is to enhance the geometric reasoning ability of the generative model by intro-ducing geometry constraints. Besides, we adapt MVCGAN to more complex and multi-object scenes. Extensive experiments on single-object and multi-object datasets demonstrate that the proposed method achieves state-of-the-art performance for 3D-aware image synthesis.

Funding Open Access funding enabled and organized by CAUL and its Member Institutions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 19
The images are rendered from 35 camera poses at resolution 256 2 CELEBA-HQ. CELEBA-HQ 2 (Karras et al., 2018) consists of 30,000 high-quality images of human face at 1024 2 resolution. During training, we sample the pitch and yaw of the camera pose from a Gaussian distribution with the horizontal standard deviation of 0.3 radians and the vertical standard deviation of 0.155 radians. FFHQ. Flickr-Faces-HQ (FFHQ) 3 Karras et al. (2019) is a large scale human face dataset which contains 70,000 high-quality images at 1024 2 resolution. The images contain various styles with different ages, ethnicity, and background. Besides, the humans in the images wear different accessories such as earrings, sunglasses, hats, and eyeglasses. In the training stage, we sample the pitch and yaw of the camera pose from a Gaussian distribution with the horizontal standard deviation of 0.3 radians and the vertical standard deviation of 0.155 radians.
AFHQv2. Animal Faces-HQ (AFHQv2) 4 Choi et al. (2020) contains 15,000 high-quality animal face images at 512 2 resolution. The dataset has three categories: cat, dog, and wildlife, with each category providing 5,000 images. Following previous works Niemeyer & Geiger, 2021;Chan et al., 2021), we conduct experiments on the cat face images to make a fair comparison. During training, we sample the pitch and yaw of the camera pose from a uniform distribution with the horizontal standard deviation of 0.4 radians and the vertical standard deviation of 0.2 radians. rokhodov et al., 2022) consists of 141,824 plant images rendered from 1,108 models at 256 2 resolution. During training, we sample the pitch and yaw of the camera pose from a uniform distribution with the horizontal standard deviation of 2π radians and the vertical standard deviation of π radians.  Room-Chair. Room-Chair (Yu et al., 2022b) is a multi-object indoor scene dataset with random number of chairs and various of backgrounds. We follow the script 7 (Yu et al., 2022b) to

B Additional Results
We provide additional results to show the multi-view consis-tency and the quality of the generated images.
As shown in Fig. 19, we render images of a single instance from 35 views images. We also provide more generated 123 Fig. 23 Images synthesized by MVCGAN on M-Food (Skorokhodov et al., 2022) at resolution 256 2 images in Figs. 20,21,22,23,and 24. Please also refer to the supplementary video 8 for more results. 8 https://youtu.be/k207rGznpEk.

Fig. 24
Images synthesized by MVCGAN+ on Room-Chair (Yu et al., 2022b) at resolution 256 2 1. The fundamental difference is that our method represents the scene in 3D space, while StyleGAN3 (Karras et al., 2021) operates in the 2D domain. To generate an image, we first query the 3D represention of the scene (neural radiance fields), and then use volume rendering to systhesis image from a specific viewpoint. Every image generated by the proposed model has a underlying 3D represetnion. Therefore, we can extract the underlying geometry of the generated image and export as meshes or pointclouds (see Figs. 15 and 16. 2. Our method is more controllable. Our method explicitly disentangles the camera pose from the latent code, while StyleGAN3 encodes both the camera pose and the identity into the latent code. Therefore, we can generate images from the same identity from different views, or generate different identity from the same viewpoint. Besides, the proposed method also support other camera operations, e.g., rotate, translate, zoom-in, and zoom-out (see the supplementary video). In contrast, the random latent walk process of StyleGAN3 (Karras et al., 2021) is arbitrary and uncontrollable. Since the identity and the camera pose are coupled in the the latent code, changing the latent code can change both the camera pose and the identity. As shown in Video 1a and Video 1b on the project page 9 of StyleGAN3 Karras et al. (2021), we observe the mouth and expression also change in different views.