1 Introduction

We study the problem of compositional 3D-aware image synthesis, aiming at generating images with explicit control over the camera pose and individual object. Different from 2D generative adversarial networks (Brock et al., 2018; Zhu et al., 2017; Choi et al., 2018; Karras et al., 2018; Huang & Belongie, 2017; Karras et al., 2019; Zheng et al., 2019; Karras et al., 2020; Choi et al., 2020), 3D-aware image synthesis models learn 3D scene representations from images, such as voxels (Nguyen-Phuoc et al., 2019, 2020), intermediate 3D primitives (Liao et al., 2020), and neural radiance fields (NeRF) (Schwarz et al., 2020; Chan et al., 2021; Niemeyer & Geiger, 2021; DeVries et al., 2021). Among these approaches, NeRF-based approaches (Schwarz et al., 2020; Chan et al., 2021; Niemeyer & Geiger, 2021; DeVries et al., 2021; Deng et al., 2022b; Gu et al., 2022) have gained a surge of interest due to the superior performance of high-fidelity view synthesis. However, two key challenges remain. (1) Existing approaches (Schwarz et al., 2020; Chan et al., 2021; Niemeyer & Geiger, 2021) do not guarantee geometry constraints between views, hence usually failing to generate multi-view consistent images in some views. (2) As pointed by Schwarz et al. (2020), current methods do not work well on scenes that contain multiple objects with complex shapes and diverse backgrounds.

In this paper, we address the first problem by proposing MVCGAN, a multi-view consistent generative model with geometry constraints (see Fig. 1).

Fig. 1
figure 1

Images synthesized by MVCGAN on the CELEBA-HQ (Karras et al., 2018) dataset. We render multi-view images at resolution \(512^2\) from different viewpoints

Fig. 2
figure 2

Typical failure cases. Taking a representative method GIRAFFE (Niemeyer & Geiger, 2021) as an example, the generated images in the first row have obvious inconsistent appearance artifacts between views, such as the direction of hair ( box) and the opening mouth ( box). Besides, we notice that GIRAFFE (Niemeyer & Geiger, 2021) suffers collapsed results under large pose variations (see the leftmost and rightmost pictures in the first row), which indicates that the model does not learn an appropriate 3D shape. In contrast, our method generates high-quality images with multi-view consistency (see the second row) (Color figure online)

Here we present typical failure cases of existing approach (Niemeyer & Geiger, 2021) in Fig. 2. We identify the cause of the inconsistent phenomenon between views: previous methods optimize a single view of the generated image independently while ignoring the geometry constraints between views (see Sect. 3.2.1). To tackle this problem, the proposed MVCGAN takes inspiration from classical multi-view geometry methods (Zhou et al., 2017; Godard et al., 2019) to build geometry constraints across views. Specifically, we perform multi-view joint optimization by enforcing the photometric consistency between pairs of views with re-projection loss and integrating a stereo mixup mechanism into the training process. Therefore, the generator not only learns the manifold of 2D images but also ensures the geometric correctness of the underlying 3D shape. Besides, we notice that NeRF-based generative approaches (Schwarz et al., 2020; Chan et al., 2021; Niemeyer & Geiger, 2021) typically struggle to render high-resolution images with fine details due to the huge computational complexity of NeRF model (Mildenhall et al., 2020). Therefore, we adopt a hybrid MLP-CNN architecture, which contains one MLP-based NeRF model and one CNN-based decoder. Specifically, the MLP-based NeRF model (Mildenhall et al., 2020) renders the geometry of the 3D shape, and the subsequent CNN-based decoder produces fine details for the 2D appearance. The structure can generate photorealistic high-resolution images while alleviating the computation-intensive problem.

We further adopt MVCGAN to multi-object and background-attached scenarios with a compositional framework, MVCGAN+. In specific, MVCGAN+ employs two MVCGAN branches to model the foreground objects and backgrounds separately. Besides, we propose a “decompose and compose” scheme to perform the complex scene generation in a top-down and bottom-up manner. During training, we explicitly incorporate the object masks to decompose the objects and backgrounds from the training images. The disentanglement of objects and backgrounds allows us to impose geometry constraints on the foreground object and the background separately. When rendering the whole scene, we compose the objects and backgrounds via object masks and occlusion relations. In summary, our main contributions are summarized as follows:

  1. 1.

    We identify the cause of the multi-view inconsistency in 3D-aware image synthesis, and propose to incorporate geometry constraints into the generative radiance field for the single-object scene generation.

  2. 2.

    To tackle complex multi-object scenes, we further scale MVCGAN to a compositional framework with top-down and bottom-up manners. To our knowledge, we are among the early attempts to incorporate instance masks into generative radiance fields to tackle complex multi-object scenarios.

  3. 3.

    We demonstrate the effectiveness and scalability of the proposed approach by evaluating on both single-object and multi-object datasets. Extensive experiments substantiate that our method achieves competitive performance for 3D-aware image synthesis.

This paper is an extension of our previous conference version (Zhang et al., 2022). Compared to the preliminary version, this work includes the following new contents. (1) Owing to the inadequate exploration of complex multi-object scenes in current works, we scale MVCGAN (Zhang et al., 2022) to a compositional framework, MVCGAN+, for multi-object 3D-aware image generation. In particular, we model the foreground objects and backgrounds with two separate branches. (2) By incorporating the easily-obtained 2D annotations, i.e., instance masks and bounding boxes, we formulate the multi-object image generation as a “decompose and compose” process. To our knowledge, we are among the first attempts to incorporate instance masks into generative radiance fields to tackle the multi-object generation problems. (3) To further validate the competence of our method, we add more experiments and discussions for ablation studies and visualization results.

2 Related Work

2.1 Multi-view Geometry

A large number of approaches reconstruct 3D structures with multi-view geometry constraints as supervision signals, such as COLMAP (Schonberger & Frahm, 2016) and ORB-SLAM (Mur-Artal et al., 2015). In recent years, some deep learning techniques (Zhou et al., 2017; Godard et al., 2019; Yao et al., 2018) also combine traditional approaches (Chen & Williams , 1993; Collins, 1996; Szeliski & Golland, 1999) to address 3D vision problems. Inspired by the classical multi-view geometry methods (Chen & Williams , 1993; Debevec et al., 1996; Andrew, 2001; Seitz & Dyer, 1996; Zhou et al., 2017; Godard et al., 2019), we explicitly involve the geometry constraints in the training process for learning a reasonable 3D shape.

2.2 Neural Radiance Fields

Recently, using volumetric rendering and implicit function to synthesize novel views of a scene has gained a surge of interest. Mildenhall et al. (2020) represent complex scenes as Neural Radiance Fields (NeRF) for novel view synthesis by optimizing an implicit continuous volumetric scene function. Due to the simplicity and extraordinary performance, NeRF (Mildenhall et al., 2020) has been extended to plenty of variants, e.g., faster training (Yu et al., 2022a), faster inference (Yu et al., 2021a; Reiser et al., 2021; Garbin et al., 2021; Rebain et al., 2021; Lindell et al., 2021), pose estimation (Yen-Chen et al., 2021; Lin et al., 2021; Jeong et al., 2021; Meng et al., 2021; Wang et al., 2021), generalization (Chibane et al., 2021; Chen et al., 2021; Yu et al., 2021b; Trevithick & Yang, 2021; Liu et al., 2022), video (Xian et al., 2021; Dynamic view synthesis , 2021; Li et al., 2021, 2021a; Peng et al., 2021), and depth estimation (Wei et al., 2021).

2.3 3D-Aware Image Synthesis

Generating photorealistic and editable image content is a long-standing problem in computer vision and graphics. In the past years, generative adversarial networks (GAN) (Goodfellow et al., 2020) have demonstrated impressive results in synthesizing high-resolution images of high quality from unstructured image collections (Zhu et al., 2017; Brock et al., 2018; Choi et al., 2018; Karras et al., 2018; Huang & Belongie, 2017; Karras et al., 2019; Zheng et al., 2019; Karras et al., 2020; Choi et al., 2020). Despite the tremendous success, most of the methods typically only learn the manifold of 2D images while ignoring the 3D representation of the scene. In recent years, several recent works have investigated how to incorporate 3D representation into generative models (Alhaija et al., 2018; Nguyen-Phuoc et al., 2019; Zhu et al., 2018; Liao et al., 2020; Nguyen-Phuoc et al., 2020; Henderson et al., 2020; DeVries et al., 2021). Nguyen-Phuoc et al. (2019) combine a strong inductive bias about the 3D world with deep generative models to learn disentangled representations of 3D objects. Nguyen-Phuoc et al. (2019) provides control over the pose of generated objects through rigid-body transformations of the learned 3D features. Schwarz et al. (2020) propose GRAF, generative radiance fields for 3D-aware image synthesis from unposed 2D images. pi-GAN (Chan et al., 2021) adopts a SIREN-based neural implicit representation with periodic activation functions as the backbone of the generator. GIRAFFE (Niemeyer & Geiger, 2021) represents scenes as compositional generative neural feature fields. ShadeGAN (Pan et al., 2021) models the illumination to regularize the training process. Combining the occupancy representation with radiance fields, Xu et al. (2021) introduce Generative Occupancy Fields (GOF) to shrink the sample region of the volume rendering process. StyleNeRF (Gu et al., 2022) integrates NeRF (Mildenhall et al., 2020) to the StyleGAN-like generator (Karras et al., 2019, 2020) to close the gap between 2D and 3D GANs. Zhou et al. (2021) extend CIPS (Anokhin et al., 2021) to CIPS-3D, a 3D-aware generator that composes of NeRF and implicit neural representation network. StyleSDF (Or-El et al., 2022) achieves high-resolution image genearation and 3D surface modeling by integrating the SDF-based 3D representation into the 2D style-based generative model (Karras et al., 2019, 2020). Recently, Chan et al. (2022) introduce a novel tri-plane representation with 3D inductive bias, resulting in a more efficient and expressive 3D GAN framework, EG3D. VolumeGAN (Xu et al., 2022) learns a structural and textural representation with a 3D feature volume and neural renderer respectively. Deng et al. (2022b) reduce the number of sampling points by learning generative 2D manifolds (GRAM), while GRAM-HD (Xiang et al., 2022) achieves better results by performing super-resolution in the 3D space. VoxGRAF (Schwarz et al., 2022) explores sparse voxel grid representations to accelerate training. Skorokhodov et al. (2022) redesign the patch-based discriminator to improve the optimization scheme of 3D generative adversarial networks. However, these methods typically optimize a single view of the generated scene independently and ignore the underlying geometry constraints across views.

3 Methodology

Our goal is to generate photorealistic high-resolution images with explicit control over the camera pose while maintaining multi-view consistency. We now present the main components of the proposed method. First, we briefly review the background of NeRF-based generative adversarial networks (Schwarz et al., 2020; Niemeyer & Geiger, 2021; Chan et al., 2021) and identify the limitations of previous methods (see Sect. 3.1). Second, we analyze the cause of the multi-view inconsistency problem and present Multi-View Consistent Generative Adversarial Networks (MVCGAN) for single object generation (see Fig. 5 for an overview). At last, based on MVCGAN, we further introduce a compositional framework (MVCGAN+) for multi-object image generation in Sect. 3.3.

3.1 Preliminaries

Neural Radiance Fields. Neural radiance field (NeRF) synthesizes novel views of the scene by optimizing a fully-connected network using a set of input views. The MLP network maps a continuous 5D coordinate (3D location \({\textbf {x}}\) and 2D viewing direction \({\textbf {d}}\)) to an emitted color \({\textbf {c}}\) and volume density \(\sigma \) (Mildenhall et al., 2020):

$$\begin{aligned} (\gamma ({\textbf {x}}), \gamma ({\textbf {d}})) \xrightarrow []{} ({\textbf {c}}, \sigma ), \end{aligned}$$
(1)

where \(\gamma \) indicates the positional encoding mapping function. To render the neural radiance field from a viewpoint, Mildenhall et al. (2020) use classic volume rendering to accumulate the output colors \({\textbf {c}}\) and densities \(\sigma \) into an image.

Generative Radiance Fields. Generative neural radiance fields aim to learn a model for synthesizing novel scenes by training on unposed 2D images. Schwarz et al. (2020) adopt an adversarial framework to train a generative model for radiance fields (GRAF). The generative radiance field is conditioned on a shape code \(z_s\) and an appearance code \(z_a\):

$$\begin{aligned} (\gamma ({\textbf {x}}), \gamma ({\textbf {d}}), z_s, z_a)) \xrightarrow []{} ({\textbf {c}}, \sigma ). \end{aligned}$$
(2)

Following GRAF (Schwarz et al., 2020), Niemeyer and Geiger (2021) introduce a compositional generative neural feature field (GIRAFFE). Inspired by StyleGAN (Karras et al., 2019), Chan et al. (2021) instead propose periodic implicit generative adversarial networks (pi-GAN) with feature-wise linear modulation (FiLM) conditioning.

Limitations. We notice two limitations of existing approaches (Schwarz et al., 2020; Niemeyer & Geiger, 2021; Chan et al., 2021). First, they do not guarantee geometry constraints between different views. Consequently, they usually suffer from collapsed results under large pose variations or have obvious inconsistent artifacts across views. Second, these approaches mostly cannot tackle the scene that contains multiple objects and complex backgrounds.

Fig. 3
figure 3

Visualization of shape-radiance ambiguity. For illustration, we assume p (the dot) is the location of correct geometry, \(p_1\) (the dot) and \(p_2\) (the dot) are incorrect geometries. In the absence of geometry constraints, the model can fit incorrect geometry \(p_1\) in view 1 and \(p_2\) in view 2 independently to simulate the effect of the correct geometry p (Color figure online)

3.2 MVCGAN for Single-Object Image Generation

3.2.1 Image-Level Multi-view Joint Optimization

Fig. 4
figure 4

Illustration of the warping process. For each pixel \(v_{pri}\) in the primary image \({\mathcal {I}}_{pri}\), we first calculate the location of \(v_{aux}\) (the corresponding pixel of \(v_{pri}\) in the auxiliary image \({\mathcal {I}}_{aux}\)) based on the depth value \({\mathcal {D}}(v_{pri})\) and camera transformation matrix [Rt]. Then we can reconstruct the pixel \(v_{pri}'\) of the warped image \({\mathcal {I}}_{warp}\) from the primary view using the value of pixel \(v_{aux}\). We observe that the warped image has a wrong appearance, which verifies the incorrect geometry shape learned by the model

Fig. 5
figure 5

The generator of MVCGAN. During training, the generative radiance field network \(G_s\) takes primary pose \(\xi _{pri}\) and auxiliary pose \(\xi _{aux}\) as input. The mapping network \(G_m\) maps the input latent z to intermediate latent w, which conditions both the generative radiance field network \(G_s\) and the progressive 2D decoder \(G_d\). In Stage I, we directly render primary image \({\mathcal {I}}_{pri}\) and auxiliary image \({\mathcal {I}}_{aux}\) with the color and density output from \(G_s\). Then we perform image-level multi-view joint optimization and output a low-resolution RGB image (\(64^2\)). In Stage II, we instead use volume rendering to accumulate 2D feature maps at low resolution (\(64^2\)), and then perform multi-view optimization at the feature level. The progressive 2D decoder \(G_d\) upsamples 2D feature map \({\mathcal {F}}_{mix}\) to a high-resolution RGB image (\(128^2\), \(256^2\), \(512^2\)) for fine 2D details. During inference, only the primary pose is required without auxiliary pose (the dotted lines do not participate in inference)

Shape-radiance Ambiguity. In this part, we analyze the cause of the multi-view inconsistency problem in NeRF-based generative models. We observe that optimizing the radiance fields from a set of 2D training images can encounter critical degenerate solutions in the absence of geometry constraints. This phenomenon is referred to as shape-radiance ambiguity (Zhang et al., 2020), in which the model can fit the training images with inaccurate 3D shape by a suitable choice of radiance field at each surface point (see Fig. 3). To better illustrate the shape-radiance ambiguity, we warp the rendered images from view 1 to view 2 based on the underlying depth and camera transformation matrix [Rt] (see the details of warping process in Fig. 4 and Eq. 4). We find the warped image shows a wrong appearance, which verifies the assumption of degenerate solutions to the learned 3D shape. To avoid the shape-radiance ambiguity, NeRF (Mildenhall et al., 2020) requires a large number of posed training images from different input views for the scene. However, generative radiance fields have neither annotated camera poses nor sufficient multi-view images in the training dataset. Consequently, the generative model can synthesize reasonable images in some views but produce poor renderings in other views (see Fig. 2).

Warping Process. To alleviate the shape-radiance ambiguity (Zhang et al., 2020), we propose to establish multi-view geometry constraints (Chen & Williams , 1993; Debevec et al., 1996; Andrew, 2001; Seitz & Dyer, 1996; Zhou et al., 2017; Godard et al., 2019) via the warping process between views. First, following pi-GAN (Chan et al., 2021), we adopt a style-based generator that contains a synthesis network \(G_s\) (a SIREN-based (Sitzmann et al., 2020; Chan et al., 2021) generative radiance field) and a mapping network \(G_m\) (a simple MLP network with ReLU) (see Fig. 5). Given a latent code \(z\in {\mathbb {R}}^{256}\) in the input latent space \({\mathcal {Z}}\), the mapping network \(G_m\):\({\mathcal {Z}} \xrightarrow [ ]{}{\mathcal {W}} \) can produce the intermidiate latent \(w \in {\mathbb {R}}^{256}\), which controls the synthesis network \(G_s\) at each layer. Second, instead of only optimizing a single view independently, we aim to optimize multiple views jointly to maintain the 3D consistency across views. As shown in the left of Fig. 5, we randomly sample two camera poses, i.e., the primary pose \(\xi _{pri}\) and the auxiliary pose \(\xi _{aux}\), from the pose distribution \(p_{\xi }\). Taking \(\xi _{pri}\) and \(\xi _{aux}\) as input, the generative model \(G_s\) synthesizes two views of the generated images separately: the primary image \({\mathcal {I}}_{pri}\) and the auxiliary image \({\mathcal {I}}_{aux}\). Then we can build geometry constraints between \(\xi _{pri}\) and \(\xi _{aux}\) via image warping, which reconstructs the primary view by sampling pixels from the auxiliary image \({\mathcal {I}}_{aux}\). Specifically, for each point \(v_{pri}\) in the primary image \({\mathcal {I}}_{pri}\), we first find the corresponding pixel \(v_{aux}\) in the auxiliary image \({\mathcal {I}}_{aux}\) through the stereo correspondence, and then reconstruct the pixel \(v'_{pri}\) of the warped image \({\mathcal {I}}_{warp}\) from primary view using the value of \(v_{aux}\) (see Fig. 4). Next, we present a detailed calculation procedure of the warping process. The stereo correspondence is calculated based on the depth map \({\mathcal {D}}\) of the primary image and camera transformation matrix from \(\xi _{pri}\) to \(\xi _{aux}\). The depth can be rendered in a similar way as rendering the color image (Mildenhall et al., 2020; Deng et al., 2022a). Given the pixel \(v_{pri}\) from the primary view, the depth value \({\mathcal {D}}(v_{pri})\) is formulated as:

$$\begin{aligned} \begin{aligned}&{\mathcal {D}}(v_{pri}) = \sum \limits _{i=1}^{N} T_i(1 - exp(-\sigma _i\delta _i))d_i, \\&T_i= exp(-\sum \limits _{j=1}^{i-1}\sigma _j\delta _j), \end{aligned} \end{aligned}$$
(3)

where N is the number of samples in the camera ray, \(\delta _i = d_{i+1} - d_{i}\) is the distance between adjacent sample points and \(\sigma _i\) is the volume density of sample i (refer to Mildenhall et al. (2020); Deng et al. (2022a) to see more details). With the depth value \({\mathcal {D}}(v_{pri})\), we can obtain the homogeneous coordinates \(h_{pri}\) of pixel \(v_{pri}\) in the primary camera coordinate system through perspective projection. Then the projected coordinates \(h_{aux}\) in the auxiliary view can be calculated as:

$$\begin{aligned} h_{aux} = K[R, t]{\mathcal {D}}(v_{pri})K^{-1}h_{pri}, \end{aligned}$$
(4)

where the camera intrinsics K are known parameters and the camera transformation matrix [Rt] can be calculated from the primary pose \(\xi _{pri}\) and the auxiliary pose \(\xi _{aux}\). Finally, we can reconstruct the pixel \(v'_{pri}\) in the warped image \({\mathcal {I}}_{warp}\) from the primary view using the value of pixel \(v_{aux}\) (located in \(h_{aux}\) of \({\mathcal {I}}_{aux}\)).

Image-level Joint Optimization. After obtaining the warped image \({\mathcal {I}}_{warp}\), we perform image-level multi-view joint optimization by enforcing the photometric consistency and employing a stereo mixup module (see Fig. 6). To satisfy the geometry constraints between views, we enforce the photometric consistency across views by minimizing the re-projection loss between the primary image \({\mathcal {I}}_{pri}\) and the warped image \({\mathcal {I}}_{warp}\). Following the common practice in image reconstruction  (Wang et al., 2004; Zhao et al., 2016; Pillai et al., 2019; Zhou et al., 2017; Godard et al., 2019; Lyu et al., 2021), we formulate the image-level re-projection loss as the combination of L1 (Zhao et al., 2016) and SSIM (Wang et al., 2004):

$$\begin{aligned} \begin{aligned}&{\mathcal {L}}_{ir} = (1 - \mu )||{\mathcal {I}}_{pri} - {\mathcal {I}}_{warp}||_1 + \\&\frac{\mu }{2} (1 - SSIM({\mathcal {I}}_{pri}, {\mathcal {I}}_{warp})), \end{aligned} \end{aligned}$$
(5)

where SSIM is a perceptual metric of image structural similarity and \(\mu =0.85\) empirically. In addition to being similar to the primary image, the warped image should also look like a real image. A straightforward method is introducing two discriminators. One is to compare the warped image \({\mathcal {I}}_{warp}\) with an arbitrary real image sampled from the training dataset, and the other one compares the primary image \({\mathcal {I}}_{pri}\). However, introducing extra modules can increase the computation complexity. Inspired by the mixup strategy (Zhang et al., 2018), we instead propose a stereo mixup module to optimize both \({\mathcal {I}}_{pri}\) and \({\mathcal {I}}_{warp}\) by constructing a virtual mixed image:

$$\begin{aligned} {\mathcal {I}}_{mix} = \eta {\mathcal {I}}_{pri} + (1 - \eta ){\mathcal {I}}_{warp}, \end{aligned}$$
(6)

where \(\eta \) is a dynamic number randomly sampled from the range of [0, 1] in every training iteration, and \({\mathcal {I}}_{mix}\) is the input of the discriminator. It is worth noting that the auxiliary pose is introduced to construct the geometry constraints, and is thus only required in the training process. In the inference stage, the generative model only takes the primary pose \(\xi _{pri}\) and latent code z as input to generate the primary image directly.

Fig. 6
figure 6

Image-level multi-view joint optimization. We enforce the photometric consistency between the primary image \({\mathcal {I}}_{pri}\) and the warped image \({\mathcal {I}}_{warp}\) by minimizing the image-level re-projection loss \({\mathcal {L}}_{ir}\). Besides, we integrate a stereo mixup module to encourage the warped image to be similar to a real image. The dotted line does not participate in the inference stage

3.2.2 Feature-Level Multi-view Joint Optimization

In practice, we also encounter one practical challenge: NeRF-based generative models (Schwarz et al., 2020; Niemeyer & Geiger, 2021; Chan et al., 2021) typically struggle to render high-resolution images with fine details due to the huge computational of NeRF (Mildenhall et al., 2020) model. To render images with both fine 2D details and correct 3D shape, we design a two-stage training strategy and extend multi-view optimization to the feature level. We begin training at a low resolution (\(64^2\)) in Stage I, and then increase to high resolutions (\(128^2\), \(256^2\), \(512^2\)) in Stage II (see Fig. 5). In Stage I, we directly render primary and auxiliary images with the color and density output from the generative radiance field network \(G_s\). With the guidance of geometry constraints, we perform image-level multi-view joint optimization to enhance the geometric reasoning ability of the model. In Stage II, to alleviate the computation-intensive problem of rendering high-resolution images, we instead train the model via feature-level multi-view optimization for better visual quality. First, we adopt a hybrid MLP-CNN architecture to disentangle the geometry of the 3D shape from fine details of 2D appearance. Then we generalize volume rendering (Niemeyer & Geiger, 2021) to the feature level by rendering 2D primary feature map \({\mathcal {F}}_{mix}\) at low resolution (\(64^2\)):

$$\begin{aligned} \begin{aligned} {\mathcal {F}}_{pri} = \sum \limits _{i=1}^{N} T_i(1 - exp(-\sigma _i\delta _i))f_i, \end{aligned} \end{aligned}$$
(7)

where \(f_i \in {\mathbb {R}}^{256}\) is the feature before the final layer of \(G_s\), and other symbols are defined in Eq. 3. The auxiliary feature map \({\mathcal {F}}_{aux}\) is rendered in the same way as \({\mathcal {F}}_{pri}\), and the warped feature map \({\mathcal {F}}_{warp}\) can be obtained through the warping process. Second, we perform multi-view feature-level joint optimization on low-resolution feature maps (\(64^2\)). To enforce the geometry consistency in the feature space, we take the implicit diversified Markov Random Fields (MRF) loss (Wang et al., 2018) as the feature-level re-projection loss:

$$\begin{aligned} {\mathcal {L}}_{fr} = L_{mrf}({\mathcal {F}}_{pri}, {\mathcal {F}}_{warp}), \end{aligned}$$
(8)

which can encourage the model to capture high-frequency geometry details (Feng et al., 2021). Then the stereo mixup mechanism is also applied to the 2D feature maps: \({\mathcal {F}}_{mix} = \eta {\mathcal {F}}_{pri} + (1 - \eta ){\mathcal {F}}_{warp}\). Third, we increase the resolution with a style-based 2D decoder (Karras et al., 2019) \(G_d\), which takes \({\mathcal {F}}_{mix}\) as input and then upsamples to high-resolution RGB image (see Fig. 7). The 2D decoder \(G_d\) is conditioned by the mapping network \(G_m\) through adaptive instance normalization (AdaIN) (Huang & Belongie, 2017; Dumoulin et al., 2020; Karras et al., 2019). As training progresses, we adopt the progressive growing strategy to grow the generator for higher resolution (Karras et al., 2018). When new layers are added to \(G_d\), we use skip connections to fade the inserted layers in smoothly to stabilize and speed up the training process (Karras et al., 2018, 2020).

Fig. 7
figure 7

Progressive 2D decoder \(G_d\). During training, the decoder takes the stereo mixup feature \({\mathcal {F}}_{mix}\) (produced by \({\mathcal {F}}_{pri}\) and \({\mathcal {F}}_{warp}\)) as input at low resolution (\(64^2\)). Then the intermediate latent w conditions the decoder at each layer. Here denotes the 1x1 convolutions which convert the high-dimensional features to RGB images, and denotes the bilinear upsampling operation

Fig. 8
figure 8

An overview of MVCGAN+. I. Decomposition Phase. We adopt a “top-down” strategy to train the object branch and the background branch. Specifically, we decompose the real images into foreground objects and backgrounds via masks and bounding boxes. Then we impose multi-view constraints to optimize the object generator \(G_{obj}\) and the background generator \(G_{bg}\) individually. Two discriminators, i.e., \(D_{obj}\) and \(D_{bg}\) , are employed to perform adversarial training on generated images and real images. II. Composition Phase. We deploy a reverse “bottom-up” manner for rendering. We first generate foreground objects images and background images with the object branch and the background branch respectively. Then the whole image can be composed with object masks and occlusion relations

3.3 MVCGAN+: Towards Multi-Object Generation

While remarkable results have been achieved on 3D-aware image generation, existing methods (Schwarz et al., 2020; Chan et al., 2021; Deng et al., 2022b; Gu et al., 2022; Xu et al., 2021; Pan et al., 2021; Chan et al., 2022) mostly focus on the scene with a single object in the center, and do not work well on multi-object scenes. At present, only GIRAFFE (Niemeyer & Geiger, 2021) considers the compositional properties of scenes and allows for multi-object image generation. However, GIRAFFE (Niemeyer & Geiger, 2021) learns the compositional generative feature fields in an unsupervised manner, which is infeasible to decompose the scene into individual objects precisely. The lack of appropriate supervision makes GIRAFFE (Niemeyer & Geiger, 2021) can only be verified on simple synthetic data, i.e., CLEVR (Johnson et al., 2017), while more realistic scenes with complex geometry shapes and diverse textures still remain unexplored.

To extend to scenarios with multiple objects and backgrounds, we further propose MVCGAN+, a two-branch framework with extra supervision (see Fig. 8 for an overview). We formulate the multi-object scene generation as a “decompose and compose” process. During training, MVCGAN+ learns the whole scene via a top-down decomposition manner. Specifically, we incorporate easily-accessed 2D annotations, i.e., object bounding boxes and instance masks, into training to disentangle objects and backgrounds. MVCGAN+ contains one object branch with \(G_{obj}\) and \(D_{obj}\), one background branch with \(G_{bg}\) and \(D_{bg}\). For the object branch, we randomly select a single object from the whole scene and crop the corresponding patch with the masked backgrounds as the real object image (see Fig. 8). We encourage the object generator \(G_{obj}\) to model the foreground object while leave the background region with empty space. One problem remains that the content of unbounded and occluded scenes, e.g., masked backgrounds, can locate at any distance of the ray. Due to the inherent ambiguity of 2D-to-3D correspondence, the object generator can generate arbitrary geometry outside the target object regions. Consequently, there may exist some semi-transparent materials floating in the space and cause cloudy and foggy artifacts when viewed from another angle.

Therefore, we add the sum of the color weights along the ray on the accumulated color to suppress the low-density areas:

$$\begin{aligned}{} & {} \hat{{\textbf {c}}} ({\textbf {r}}) = {\textbf {c}} ({\textbf {r}}) + \left( 1 - \sum \limits _{i=1}^{N} T_i(1 - exp(-\sigma _i\delta _i))\right) * c_{white}, \nonumber \\ \end{aligned}$$
(9)

where \({\textbf {c}} ({\textbf {r}})\) is the accumulated color of ray \({\textbf {r}}\) by volume rendering, \(c_{white}=1\) is the color of the white background (the value of white color equals to 1 in the normalized color space), \(\sum \limits _{i=1}^{N} T_i(1 - exp(-\sigma _i\delta _i))\) is the sum of weights of sampled color along the ray r (see more details in Eq. 3 and Eq. 5 of the original NeRF paper (Mildenhall et al., 2020)), and other symbols are defined in Eq. 3.

For the background branch, we follow NeRF++ (Zhang et al., 2020) to use an additional network \(G_{bg}\) to model the complex backgrounds. As shown in Fig. 8, we remove all the foreground objects and fill the holes with the image inpainting methods (Telea, 2004). Considering the layout and geometries of the background environment are relatively simple, we can easily inpaint the occluded areas by searching the patches with similar textures from surrounding regions. In this way, MVCGAN+ models the objects and backgrounds individually by leveraging the information of object bounding boxes and instance masks. The disentanglement of the objects and backgrounds allow us to impose the multi-view geometry constraints on the object and the background branch separately.

Table 1 Quantitative comparison

In the composition phase, to compose the generated objects and backgrounds into a coherent scene, we first perform object arrangements and then reason about the geometry relations between foreground objects and the backgrounds. For the object placement, we follow GIRAFFE (Niemeyer & Geiger, 2021) to transform the coordinate of the object-centric space to the scene space with the rotation matrix \(R_{obj}\) and the translation vector \(t_{obj}\):

$$\begin{aligned} \begin{aligned}&k ({\textbf {x}}) = R_{obj} {\textbf {x}} + t_{obj}, \\ \end{aligned} \end{aligned}$$
(10)

where \(k ({\textbf {x}})\) is the transformed coordinate, \(t_{obj}\) is the object location in the scene space. We generate the holistic image by performing alpha composition:

$$\begin{aligned} \begin{aligned}&{\mathcal {I}}_{final} = {\mathcal {I}}_{fg} \cdot {\mathcal {M}} + (1 - {\mathcal {M}}) \cdot {\mathcal {I}}_{bg}, \\ \end{aligned} \end{aligned}$$
(11)

where \({\mathcal {I}}_{fg}\) is the rendered foreground object image, and \({\mathcal {I}}_{bg}\) is the rendered background image. The foreground object mask \({\mathcal {M}} = \sum \limits _{i=1}^{N} T_i (1 - exp(-\sigma _i\delta _i)\) is generated by \(G_{obj}\) according to the accumulated density. For the overlapping areas between objects, we reason about the occlusion relations by combing 3D spatial locations of objects and depth values. Specifically, for every pixel in the render image, the object closest to the camera location will occlude other objects as well as the backgrounds.

4 Experiments

4.1 Datasets

We conduct experiments on both single-object and multi-object datasets.

Single-object Datasets. For the single-object datasets, we report results on five high-resolution image datasets: CELEBA-HQ (Karras et al., 2018), FFHQ (Karras et al., 2019), AFHQv2  (Choi et al., 2020), M-Plants (Skorokhodov et al., 2022), and M-Food (Skorokhodov et al., 2022). CELEBA-HQ (Karras et al., 2018) consists of 30,000 high-quality images of human face. Flickr-Faces-HQ (FFHQ) (Karras et al., 2019) is a widely-used human face dataset that contains 70,000 high-quality images. Animal Faces-HQ (AFHQv2) (Choi et al., 2020) contains 15,000 high-quality animal face images. Here we choose the cat face images in the AFHQv2 (Choi et al., 2020) dataset to conduct experiments. Megascans Plants (M-Plants) (Skorokhodov et al., 2022) dataset consists of 141,824 plant images, while Megascans Food (M-Food) (Skorokhodov et al., 2022) contains 25,472 food images.

Multi-object Datasets.

For multi-object scenes, existing datasets, e.g., CLEVR (Johnson et al., 2017), multi-dSprites  (Matthey et al., 2017), Object Room (Burgess & Kim, 2018), Tetrominoes (Kabra et al., 2019), and CATER (Girdhar & Ramanan, 2019), typically contain objects with the simplest geometric shapes and plain backgrounds. Take a representative dataset CLEVR (Johnson et al., 2017) as an example, the scene contains 3 kinds of objects, i.e., cube, sphere, and cylinder, all of which are geometric primitives that have standard and symmetrical geometries. In this paper, we conduct experiments on a more complex and realistic dataset Room-chair (Yu et al., 2022b), which contains indoor scenes with chairs, walls, and floors. Specifically, we adopt the script (Yu et al., 2022b) to render 32,000 images at a resolution of \(256^2\). To render chairs with diverse shapes, we choose 649 chair models from ShapeNet (Chang et al., 2015) library. For the backgrounds, we use 50 types of floors with different textures and materials, e.g., wooden floors. Each image contains a random number of chairs with a maximum number of 4. Besides, we also render the instance masks and obtain the object bounding boxes as the annotations. It is worth noting that the geometry of the chair is much more complex than other objects in ShapeNet (Chang et al., 2015) like cars and bowls, because chairs have many thin and fine structures such as backrests and legs. To our knowledge, Room-chair is the most challenging multi-object dataset we can find.

4.2 Training Details

We use a progressive growing convolutional discriminator \(D_{\phi }\) to compare the fake image produced by generator \(G_{\theta }\) and real image \({\mathcal {I}}\) sampled from the training data with distribution \(p_{\mathcal {D}}\). For single-object generation, we train MVCGAN using a non-saturating GAN objective with \(R_1\) gradient penalty (Mescheder et al., 2018) and the proposed geometry-constrained objective \(L_{re}\) as the total loss:

$$\begin{aligned} \begin{aligned}&{\mathcal {V}}(\theta , \phi ) = {\textbf {E}}_{z\sim {\mathcal {Z}}, \xi _{pri}\sim p_{\xi }, \xi _{aux}\sim p_{\xi }}[f(D_{\phi }(G_{\theta }(z, \xi _{pri}, \xi _{aux}))] \\&\qquad \qquad + {\textbf {E}}_{{\mathcal {I}}\sim p_{{\mathcal {D}}}}[f(-D_{\phi }({\mathcal {I}})) -\lambda || \nabla D_{\phi }({\mathcal {I}})||^2] + {\mathcal {L}}_{re}, \\ \end{aligned}\nonumber \\ \end{aligned}$$
(12)

where \(f(t)=-log(1+exp(-t))\), \({\mathcal {L}}_{re} = {\mathcal {L}}_{ir}\) for Stage I (see Eq. 5), \({\mathcal {L}}_{re} = {\mathcal {L}}_{fr}\) for Stage II (see Eq. 8), and \(\lambda =10\). We employ Adam optimizer (Kingma & Ba, 2015) with \(\beta _1 = 0\), \(\beta _2 = 0.9\), and the batch size of 56 for optimization. The initial learning rate is set to \(6.0\times 10^{-5}\) for the generator and \(2.0\times 10^{-4}\) for the discriminator, and decay over training to \(1.5\times 10^{-5}\) and \(5.0\times 5^{-5}\) respectively.

For the multi-object generation, we train the object generator \(G_{obj}\) and the background generator \(G_{bg}\) using the same Adam optimizer, the learning rate, and the batch size as the single-object generation. The main difference between the multi-object and the single-object generation is that we sample camera pose from different distributions due to the different scenes of training datasets (please refer to Sect. 1 of Appendix for the specific camera pose distribution of each dataset).

Fig. 9
figure 9

The face identity preservation score (Face-IDS) of images

Fig. 10
figure 10

Qualitative comparison at \(512^2\) resolution on single-object datasets

Fig. 11
figure 11

Qualitative comparison at \(256^2\) resolution under the multi-object setting on Room-Chair (Yu et al., 2022b). We render the scenes from different camera view points

Fig. 12
figure 12

Ablation study on FFHQ (Karras et al., 2019) at \(256^2\) resolution

4.3 Comparison with SOTA

For quantitative comparison, we report Frechet Inception Distance (FID) (Heusel et al., 2017) to evaluate image quality. We compare our approach against five state-of-the-art 3D-aware image synthesis methods: GRAF (Schwarz et al., 2020), pi-GAN (Chan et al., 2021), GOF (Xu et al., 2021), ShadeGAN (Pan et al., 2021), GIRAFFE (Niemeyer & Geiger, 2021), and GRAM (Deng et al., 2022b). As shown in Table 1, our method consistently outperforms other methods (Schwarz et al., 2020; Niemeyer & Geiger, 2021; Chan et al., 2021; Xu et al., 2021; Pan et al., 2021) on both single-object and multi-object datasets (Karras et al., 2018, 2019; Choi et al., 2020; Skorokhodov et al., 2022) by a large margin. Especially, on the Room-Chair dataset (Yu et al., 2022b), we observe most methods cannot handle the multi-object scenarios and fail to learn an appropriate generative model for the scene. In contrast, the extension MVCGAN+ can effectively render the compositional scenes with disentangled objects and backgrounds, outperforming GIRAFFE by a clear margin. To further demonstrate the effectiveness of the proposed method, we also visualize the generated images on single-object and multi-object datasets for qualitative comparison. As illustrated in Figs. 10 and 11, we render images from a wide range of viewpoints. On single-object datasets, we observe that GRAF (Schwarz et al., 2020), GIRAFFE (Niemeyer & Geiger, 2021) and pi-GAN (Chan et al., 2021) either fail to synthesize reasonable results under large view variations or have obvious multi-view inconsistent artifacts. For multi-object scenarios, we note that GIRAFFE (Niemeyer & Geiger, 2021) suffers from collapsed results when the viewpoint changes. By comparison, our method achieves the best performance both in visual quality and multi-view consistency. Please refer to the appendix and the supplementary materialFootnote 1 for more visualization results.

4.4 Ablation Studies

Image-level and Feature-level Optimization. We conduct ablation studies to help understand the individual contributions of image-level and feature-level multi-view joint optimization. From Fig. 12a, we observe that the generated images maintain the multi-view consistency under pose variations (FID = 22.5), indicating that image-level optimization can guide the model to learn a reasonable 3D shape. With feature-level optimization (see Fig. 12b), our approach can further improve the visual quality of generated images  (FID = 13.7). As shown in Fig. 12, we note that the images generated by feature-level optimization have more delicate details, such as clear wrinkles, the highlight on the forehead, and the shadow of the cheeks.

Fig. 13
figure 13

Without the decomposition phase, the generated images will have poor object qualities and cannot disentangle objects and backgrounds

Multi-view Consistency. On the human face dataset (Karras et al., 2019), we take inspiration from Lin et al. (2022) to adopt the face identity preservation score (Face-IDS) to evaluate the multi-view consistency of generated images. For the portrait image animation and attribute-editing task (Lin et al., 2022; Wu et al., 2022; Deng et al., 2020), the face identity preservation score (Face-IDS) can reflect how well the identity is preserved for the target image compared to the source image. Here we use Face-IDS to evaluate the multi-view consistency by measuring the similarity between different views. We first generate 1000 faces and render each face from two random camera poses. Then, for each image pair of the same generated face, we calculate the cosine similarity of the predicted embeddings with a pertrained ArcFace network (Deng et al., 2019). The ArcFace similarity score has values between – 1 and 1 (greater value means more similar, see Fig. 9 for examples). Finally, we compute the mean score of 1000 faces as the face identity preservation score (Face-IDS).

Fig. 14
figure 14

Scene Decomposition. The generated images can be decomposed into individual objects and backgrounds

Fig. 15
figure 15

Visualization of extracted 3D meshes with single-view 3D reconstruction (Lorensen & Cline, 1987)

As shown in Table 2, our method achieves the best face identity preservation score (multi-view consistency). We further conduct experiments to study whether increasing the number of auxiliary poses can improve the multi-view consistency or not. From Table 2, we observe that using more auxiliary poses leads to degenerated performance: the face identity preservation score (Face-IDS) decreases to 0.58 and 0.51 for 2 and 3 auxiliary poses respectively. We suspect the performance drop is caused by two reasons. First, since both the primary and auxiliary poses are randomly sampled from the camera pose distribution, sampling more poses cannot bring performance gain. Second, increasing the number of auxiliary poses brings much more GPU memory consumption, because the model has to perform volume rendering many times for one iteration. Consequently, we need to reduce the batch size to 8 and adjust the weight of the \(R_1\) gradient penalty (Mescheder et al., 2018) (\(\lambda \) in Eq. 12). However, the decreased batch size affects the training stability and makes the model hard to converge, while the increased \(R_1\) regularization weight can hurt the overall performance (Karras et al., 2021; Mescheder et al., 2018).

Table 2 Quantitative evaluation of multi-view consistency
Fig. 16
figure 16

Visualization of the COLMAP reconstruction (Schonberger & Frahm, 2016) from synthesized multi-view images

Fig. 17
figure 17

Style interpolation. We perform linear interpolation simultaneously in both the intermediate latent and camera pose space. We can observe that the transition results are smooth and consistent

Markov Random Fields Loss.

Previous papers (Feng et al., 2021) found ID-MRF loss can better capture high-frequency details than L1 loss in the 3D face reconstruction (Feng et al., 2021) and image reconstruction task (Wang et al., 2018). Therefore, we adopt Implicit Diversified Markov Random Field (ID-MRF) loss (Wang et al., 2018) to enforce the geometry consistency between views. We also conduct experiments to compare the effect of ID-MRF and L1 loss on multi-view consistency. Since there is no ground truth for the generated image, we adopt the face identity preservation score (Face-IDS) as the quantitative metric of multi-view consistency.

When using the vanilla L1 loss, we observe that the model still achieves similar multi-view consistency (face identity preservation score = 0.61) as the ID-MRF loss (face identity preservation score = 0.62). It seems that ID-MRF loss has no obvious advantages over L1 loss. We suspect that the problem is in the quantitative metric of multi-view consistency, because the face identity preservation score may not be able to capture these high-frequency details such as the wrinkles visualized in the Fig. 9 of Feng et al. (2021). As mentioned in the last paragraph (Multi-view Consistency), we compute the face identity preservation score (Face-IDS) with the Arcface cosine similarity (Deng et al., 2019). But the extracted embedding may lose the high-frequency and fine-grained details due to the pooling operation of the ArcFace network (Deng et al., 2019). Therefore, the simple L1 loss can obtain a similar face identity preservation score as ID-MRF loss.

Decompose and Compose.

The “decompose and compose” paradigm is essential in compositional image generation. If we directly generate multiple objects using the single object method and then compose them into a whole scene without the decomposition phase, the generated image will have poor object qualities and cannot disentangle objects and backgrounds (see the Fig. 13). This problem mainly comes from the discriminator, which plays a critical role in the training process of GANs. If there is no decomposition phase, we need to perform adversarial training with a scene-level discriminator between the rendered scenes and real images. In this case, the model will pay more attention to the global coherence of the whole scene, and neglect the supervision of individual objects. For a single object in the scene, the scene-level discriminator can provide weak learning signals on the object radiance field, because the object region only occupies a small proportion of the whole image. The inadequate training of the single object can lead to degenerated object quality. More importantly, the scene-level discriminator can not disentangle objects and backgrounds, making the background generator easily overfit the whole scene. In contrast, we deploy the decomposition phase to train the object branch and background branch individually. On the one hand, using two discriminators (the object discriminator \(D_{obj}\) and the background discriminator \(D_{bg}\)) can provide sufficient supervision for objects, leading to a better quality of the generated objects. On the other hand, the disentanglement of objects and backgrounds allows us to control them separately, such as moving and rotating each object or the background.

Scene Decomposition. We also investigate the disentanglement of foregrounds and backgrounds of MVCGAN+. As shown in Fig. 14, our method can decompose foreground objects and backgrounds from the holistic scene. The disentanglement allows us to control each object and the background individually. We can perform scene editing such as adding, moving, deleting, rotating, and changing individual objects or backgrounds. Please refer to the supplementary video for more visualization results.

Fig. 18
figure 18

Style mixing. The source A and B images are generated from input latent codes \(z_A\) and \(z_B\). The images in the box are generated by applying the \(w_B\) (the intermediate latent corresponding to \(z_B\)) to \(G_s\) and \(w_A\) (corresponding to \(z_A\)) to \(G_d\). The images in the box are generated by applying the \(w_A\) to \(G_s\) and \(w_B\) to \(G_d\) (Color figure online)

3D Representation. To better illustrate the learned 3D representation, we visualize the underlying 3D shape from generated images with 3D reconstruction methods (Schonberger & Frahm, 2016; Lorensen & Cline, 1987). For the single-view 3D reconstruction, we adopt the marching cubes algorithm (Lorensen & Cline, 1987) to extract the underlying geometry of the generated image (see Fig. 15 for the visualized 3D meshes). To further demonstrate the multi-view consistency of our method, we also perform multi-view 3D reconstruction (Schonberger & Frahm, 2016) to recover the 3D shape from generated multi-view images. Specifically, we first render images of a single instance from 35 views, and then perform dense 3D reconstruction by running COLMAP (Schonberger & Frahm, 2016) with default parameters and no known camera poses. The results in Figs. 15 and 16 validate the correctness of the 3D shape learned by our model.

Style Interpolation. We also conduct style interpolation experiments to investigate the intermediate latent w learned by the mapping network \(G_m\). Given two generated images, we perform linear interpolation both in the intermediate latent space \({\mathcal {W}}\) and the camera pose space. As illustrated in Fig. 17, the smooth transition of both pose and appearance demonstrates that our model learns semantically meaningful intermediate latent space \({\mathcal {W}}\).

Shape-detail Disentanglement. Besides, we design a style mixing experiment to study what kinds of representations the generative radiance field \(G_s\) and progressive 2D decoder \(G_d\) learned respectively. Specifically, we input two latent codes \(z_A\) and \(z_B\) into the mapping network \(G_m\), and obtain the corresponding intermediate latent \(w_A\), \(w_B\) in \({\mathcal {W}}\) space. Then we can generate style mixing images by applying \(w_A\) and \(w_B\) to control the different parts of the generator (\(G_s\) and \(G_d\)). As shown in Fig. 18, we observe that controlling \(G_s\) changes the 3D shape (identity and pose) while controlling \(G_d\) changes 2D appearance details (colors of skins, hair, and beard). The results verify that the hybrid MLP-CNN architecture can disentangle the geometry of the 3D shape from fine details of the 2D appearance.

5 Conclusion

We present a multi-view consistent generative model (MVCGAN) for compositional 3D-aware image synthesis. The key idea underpinning the proposed method is to enhance the geometric reasoning ability of the generative model by introducing geometry constraints. Besides, we adapt MVCGAN to more complex and multi-object scenes. Extensive experiments on single-object and multi-object datasets demonstrate that the proposed method achieves state-of-the-art performance for 3D-aware image synthesis.