1 Introduction

Manipulating 3D objects derived from a single photo poses a considerable challenge. Unlike previous works that directly edit 3D man-made objects directly [1], this task comprises numerous complex subtasks. These tasks include extracting the target shape from the background, estimating the 3D pose, shape, and materials of the object, estimating the lighting conditions in the scene through image observation, and defining the controllers to manipulate the estimated object properties. Recently, advanced deep learning-based methods have been developed to perform numerous subtasks, including object segmentation [2]; object pose [3], shape [4,5,6], and material estimation [7]; and lighting estimation [8, 9].

Fig. 1
figure 1

Overview of StylePart. a We first project the input image into the GAN latent space, and b map the projected GAN latent code \(\textbf{w}_\text {input}\) to its corresponding shape attributes and viewing angle using a forward shape-consistent latent mapping function. c A user can directly manipulate the image shape at the part-level. Then, we obtain the manipulated GAN latent code \(\textbf{w}_\text {manipulate}\) by d mapping the manipulated attributes to the GAN latent space with a backward mapping function. Finally, we e synthesize the final edited image without the need of any 3D workflow

Moreover, generative adversarial networks (GANs) have opened up a new high-fidelity image generation paradigm. For example, because of its disentangled style space, StyleGAN [10] can produce high-resolution facial images with unmatched photorealism and support stylized manipulation. Users can manipulate the generated outputs by adjusting the latent code [11,12,13,14,15]. Furthermore, they can also edit a natural image by projecting it into the GAN image latent space, identifying a latent code that reconstructs the input image, and then modifying that code [16, 17].

Semantic 3D controllers have been proposed to allow for semantic parameter control over images. For example, the 3D morphable face model (3DMM) has particularly been used to manipulating face shape, facial expression, head orientation, and scene illumination in both generated [18] and natural [19, 20] human face images. However, the current methods are exclusively applicable to portrait face images, posing a limitation in their usage.

In this paper, we present StylePart, which investigates how to achieve part-based manipulation of man-made objects in a single image. The key insight of our method is to augment a 3D generative model representation that is controllable and with semantic information with the image latent space. We propose a shape-consistent mapping function that connects the image generative latent space and latent space of the 3D man-made shape attribute. The shape-consistent mapping function is composed of forward and backward mapping functions. We “forwardly map” the input image from the image latent code space to the shape attribute space, where the shape can be easily manipulated. The manipulated shape is “backwardly mapped” to the image latent code space and synthesized using a pretrained StyleGAN (Fig. 1). With the backward mapping, the user can obtain final edited image without resorting to any 3D workflow. This brings a huge advantage so that the users do not need to understand how to manipulate the lighting conditions, materials properties, and every rendering parameter. We also propose a novel training strategy that guarantees the soundness of the mapped 3D shape structure. Overall, our method offers a significant advantage over other methods as it allows for explicit part-based control through the PartVAE code. For example, although pixelNeRF [21] can be used to render novel views of a shape from a sparse set of views, it is not intuitive to manipulate the shape directly. Moreover, compared to baseline conditional VAE, our method achieves better shape reconstruction. Finally, our method includes both forward and backward mapping functions, making it highly versatile for a wide range of image-based shape editing applications as shown in this paper.

Note that we focus on the subtask of extracting controllers to manipulate the shape only, instead of proposing a full image manipulation system that tackle all subtasks altogether. Thus we use images with a simple rendering style without complex lighting conditions and material properties to demonstrate our results. We evaluate our method through shape reconstruction tests, considering four man-made object categories (chair, cup, car, and guitar). We also present several identity-preserving manipulated results of three shape part manipulation tasks: including part replacement, part resizing, and orientation manipulation.

2 Related work

2.1 Image-based shape reconstruction and editing

Three-dimensional modeling based on a single photo has been a challenge in the field of computer graphics and computer vision. Several studies have investigated shape inference from multiple images [22, 23] and single photos for different shape representations, such as voxel [24, 25], point cloud [26], mesh [5, 6, 27], and simple primitives [28]. With advances in deep learning methods, the quality of reconstructed shapes and images has improved tremendously; however, they are usually not semantically controllable. Chen et al. [28] proposed 3-Sweep, an interactive method for shape extraction and manipulation in a photo. A user can create 3D primitives (cylinders and cuboids) using 3-Sweep and manipulate the photo content using extracted primitives. Banerjee et al. [29] enable users to manually aligned a publicly available 3D model to guide the completion of geometry and light estimations. Zheng et al. [30] proposed a system that extracts cuboid-based proxies automatically and enable semantic manipulations. Unlike previous methods, our method leverages recent advances in part-based generative shape representations [31, 32] to automatically infer shape attributes. Moreover, the shape parts are more complex than simple cylinders and cuboids used in previous methods [28, 30].

2.2 GAN inversion and latent space manipulation

GAN inversion is required to edit a real image through latent manipulation. GAN inversion identifies the latent vector from which the generator can best replicate the input image. Inversion methods can typically be divided into optimization- and encoder-based methods. Optimization-based methods which directly optimize the latent code using a single sample [33,34,35,36], whereas encoder-based methods train an encoder over a large number of samples [37,38,39]. Some recent works have augmented the inversion process with additional semantic constraints [40] and additional latent codes [41]. Among these works, many have specifically considered StyleGAN inversion and investigated latent spaces with different disentanglement abilities, such as \(\mathcal {W}\) [42], \(\mathcal {W}^{+}\) [33, 34, 43], and style space (\(\mathcal {S}\)) [44]. Several works have examined semantic directions in the latent spaces of pretrained GANs. Full-supervision using semantic labels [11, 45] and self-supervised approaches [46,47,48] have been employed. Moreover, several recent studies have utilized unsupervised methods to obtain semantic directions[12, 13, 49]. A series of works focused on real human facial editing [18, 19] and non-photorealistic faces [50]; they utilized a prior in the form of a 3D morphable face model.

Unlike EditGAN [43], StylePart enables more 3D-aware part-based image editing such as part replacing and shape orientation manipulation. Compared to Liu et al. [51], there is no need to prepare a part-level CAD model beforehand and our method supports more categories of objects.

3 Method

Fig. 2
figure 2

The network architecture of our cross-domain mapping framework

We aim to enable users to directly edit the structure of a man-made shape in an image. We first describe the background of the 3D shape representation method, which allows for tight semantic control of a 3D man-made shape. We then describe a new neural network architecture that maps latent vectors between the image and 3D shape domains, and highlight the different shape part manipulation methods of the architecture.

3.1 3D shape attribute representation

We adopt a structured deformable mesh generative network (SDM-NET) [31] to represent the man-made shape attributes. The network allows for tight semantic control of different shape parts. Moreover, it generates a spatial arrangement of closed, deformable mesh parts, which represents the global part structure of a shape collection, e.g., chair, table, and airplane. In SDM-NET, a complete man-made shape is generated using a two-level architecture comprising a part variational autoencoder (PartVAE) and structured parts VAE (SP-VAE). PartVAE learns the deformation of a single part using its Laplacian feature vector, extracted through the method established in [52], and SP-VAE jointly learns the deformation of all parts of a shape and the structural relationship between them. In this work, given an input shape s of category c with \(n_{c}\) as the number of parts, we represent its shape attributes \(\varvec{S}=(\varvec{P}, \varvec{T})\) using both

  • Geometric attribute (\(\varvec{P}\in \mathbb {R}^{z\times n_{c}}\)): all the PartVAE latent codes of each shape, where \(z=128\) is the dimension of the PartVAE latent code adopted in our work.

  • Topology attribute (\(\varvec{T}\in \mathbb {R}^{2n_{c}+73}\)): the representation vector \(\textbf{rv}\) in SDM-Net. This vector represents the associated relationships of each part in s, including existence, supporting, and symmetry information. Different shapes in the same category will have different \(\textbf{rv}\). Please refer to Sect. 3.1 in Gao et al. [31].

3.2 Shape-consistent mapping framework

At the core of our image-based shape part manipulation pipeline (shown in Fig. 2) is a cross-domain mapping structure between the image latent space \(\mathcal {W}\) and the shape attribute space \(\mathcal {S}\). Given an input image containing a man-made object (I), the image inversion method is first applied to optimize a GAN latent code (\(w_{I}\)) that can best reconstruct the image. Then, the corresponding shape attributes (\(\varvec{S}\)) are obtained using a forward shape-consistent latent mapping function (\(M_{F}\)). The viewing angle of the input image is predicted using a pretrained viewing angle predictor (\(M_{V}\)). From the mapped geometric attribute, topology attribute, and the predicted viewing angle, we can perform image-based shape editing tasks, such as part replacement, part deformation, and viewing angle manipulation by manipulating the attribute codes. After manipulating the attribute codes, we use the backward mapping function (\(M_{B}\)) to map the shape attributes to an image latent code (\({w}'\)). The final edited image can be synthesized as \(I_{{w}'} = G({w}')\).

3.2.1 Image inversion

To obtain the latent code of the input image I, we optimize the following objective:

$$\begin{aligned} w_{I} = \mathop {\mathrm {arg\,min}}\limits _{w} \mathcal {L}_{LPIPS}(I, G(w;\theta ))+\lambda _{w}\mathcal {L}_{\textbf{w}_\text {reg}}(w), \end{aligned}$$
(1)

where G is a pretrained StyleGAN-ADA [53] generator with weight \(\theta \), \(\mathcal {L}_{LPIPS}\) denotes the perceptual loss [54], and \(\mathcal {L}_{\textbf{w}_\text {reg}} = \Vert w\Vert ^2_2\) denotes the latent code regularization loss. We introduced \(\mathcal {L}_{\textbf{w}_\text {reg}}\) above because we observed that there are multiple latent codes that can synthesize the same image. By introducing \(\textbf{w}_\text {reg}\), we regularize the solution for each input image.

3.2.2 From \(\mathcal {W}\) to \(\mathcal {S}\)

Given a GAN latent code \({w_I}\), which is inverted from an input image I, we obtain the corresponding shape attribute code \(\varvec{S}=(\varvec{P}, \varvec{T})\) using the forward mapping network \((\varvec{P}, \varvec{T}) = M_F(\varvec{w_I})\), where the 3D shape generated by \((\varvec{P}, \varvec{T})\) best fits the target man-made shape encoded in the image latent code \(\varvec{w_I}\).

3.2.3 From \(\mathcal {S}\) to \(\mathcal {W}\)

Given the shape attribute \(\varvec{S}\) and one-hot vector of a viewing angle \(\varvec{v}\), the backward mapping function predicts a GAN latent code \(\textbf{w}' = M_{B}(\varvec{S}, \varvec{v})\), where the synthesized image \(I_{{w}'} = G({w}')\) best matches the image containing the target shape described in \(\varvec{S}\) with viewing angle \(\varvec{v}\). Our mapping functions (\(M_{F}\) and \(M_{B}\)) are realized using two eight-layer MLP networks.

Fig. 3
figure 3

a With Laplacian feature differences, some deformations cannot be captured (highlighted by red rectangles). b On the contrary, vertex coordinate differences capture these deformations more accurately

3.3 Training strategy and loss function

3.3.1 Data preparation

To train our shape-consistent mapping function, both the image-based latent code and the shape attribute code for each man-made object shape are required. In this paper, we use synthetic datasets of four categories (chair, guitar, car, and cup). A dataset contains \(N^M\) shapes, formed by interchanging the M parts of N shapes. For each shape, we apply a simple procedure to ensure there is no gap in the interchanged shape. We fix one part of the shape and move the rest parts toward the fixed part until the bounding boxes have an intersection. We render these shapes according to different viewing angles to obtain the paired shapes and images. We sample the viewing angles at \(30^{\circ }\) intervals on the yaw axis. For the chair and guitar category, \(N=15\) and \(M=3\); for the cup and car categories, \(N=50\) and \(M=2\). There are 3375 shapes and 40, 500 images for the chair and guitar categories, and 2500 shapes and 30, 000 images for the car and cup category. We split the synthetic datasets for each category into \(80\%\) training data and \(20\%\) testing data For shape k, we first prepare the shape attribute code \(\varvec{S}_k\). Each part is represented by a feature matrix \(f\in \mathbb {R}^{9\times V}\) that describes the per-vertex deformation of a template cube with \(V=3752\) vertices. Let \(n_c\) denote the number of the parts for category c, and \(z_\textrm{part}=128\) is the dimension of a PartVAE latent code. We embed the feature matrix of part i (\(f_i\)) into a pretrained PartVAE and obtain \(\varvec{P}_i = Enc_{P_i}(f_i)\). We compute topology vector \(\varvec{T} \in R^{2n+73}\) for each shape. Finally, for each shape, we concatenate the PartVAE codes of all parts and the topology vector into the final shape attribute vector \(\varvec{S}=(\varvec{P}_0, \varvec{P}_1,..., \varvec{P}_{n-1}, \varvec{T})\).

Next, we prepare the image-based latent code for shape k. We render all shapes from 12 viewing angles and use these images to train a StyleGAN2 through adaptive discriminator augmentation [53], which depends on the image’s viewing angle. We project the rendered image containing shape k into the pretrained conditional StyleGAN latent space to obtain the corresponding latent code \(\varvec{w}_k \in R^{512}\). We collect all \((\varvec{w}_k, \varvec{S})\) pairs for training \(M_{F}\) and \(M_{B}\).

3.3.2 Training for \(M_F\)

We train the forward mapping function \(M_{F}\) using a two-step process comprising Laplacian feature reconstruction training and size finetuning. In the Laplacian feature reconstruction training step, we use the following loss function involving the feature vector reconstruction loss (\(\mathcal {L}_{\varvec{P}_\text {recon}}\)) and the topology loss (\(\mathcal {L}_{\varvec{T}}\)):

$$\begin{aligned} \mathcal {L}_{M_{F}} = \mathcal {L}_{\varvec{P}_\text {recon}} + \mathcal {L}_{\varvec{T}}, \end{aligned}$$
(2)

where \(\mathcal {L}_{\varvec{P}_\text {recon}}\) and \(\mathcal {L}_{\varvec{T}}\) are defined as follows:

$$\begin{aligned} \mathcal {L}_{\varvec{P}_\text {recon}}&= \sum _{i=1}^{{N}} ||{Dec}_{P_i}(\varvec{P'}) - {Dec}_{P_i}(\varvec{P})||_2^2 \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{\varvec{T}}&= \sum _{i=1}^{{N}} \Vert \varvec{T'} - \varvec{T}\Vert _2^2 \end{aligned}$$
(4)

Here, \((\varvec{P'}, \varvec{T'})\) are the shape attributes predicted by \(M_{F}\), \(Dec_{P_i}(\cdot )\) denotes the pretrained PartVAE decoder of the i-th part, and N denotes the number of parts. In the size finetuning step, we replace the reconstructed feature vectors from the Laplacian feature to the vertex coordinates. As shown in Fig. 3, we observed the Laplacian feature vectors are sensitive to local differences but insensitive to global shape differences. The loss function in this step can be written as:

$$\begin{aligned} \hat{\mathcal {L}}_{M_{F}}&= \mathcal {L}_{\varvec{V}_\text {recon}} + \mathcal {L}_{\varvec{T}} \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L}_{\varvec{V}_\text {recon}}&= \sum _{i=1}^{{N}} ||\mathcal {T}({Dec}_{P_i}(\varvec{P'})) - \mathcal {T}({Dec}_{P_i}(\varvec{P}))||_2 \end{aligned}$$
(6)

where \(\mathcal {T}\) transforms a vertex Laplacian feature vector into its vertex coordinates according to the steps described in [52].

3.3.3 Training for \(M_B\)

Our backward mapping function \(M_{B}\) minimizes the following loss:

$$\begin{aligned} \mathcal {L_{\textit{M}_{\textit{B}}}} = \mathcal {L_{\textbf{w}_\text {recon}}} + \lambda _1 \mathcal {L_{\textbf{w}_\text {reg}}} \end{aligned}$$
(7)

where \(\lambda _1\) is the weight of the \(l_2\) norm regularization term of the image latent code \(w'\) (the same weight used in image inversion in Eq. 1), and \(\mathcal {L_{\textbf{w}_\text {recon}}}\) is defined as:

$$\begin{aligned} \mathcal {L_{\textbf{w}_\text {recon}}} = \mathcal {L}_{LPIPS}(G({w}'), G({w})) \end{aligned}$$
(8)

where G is the pretrained Style-ADA generator, \(w'\) is the mapped image latent code of \((\varvec{P}, \varvec{T})\), and \(\mathcal {L}_{LPIPS}(\cdot )\) is the perceptual loss function [54].

3.3.4 Finetuning for \(M_F\) and \(M_B\)

After training \(M_{F}\) and \(M_{B}\) separately, we finetune them together using the following loss function:

$$\begin{aligned} \mathcal {L_{\text {finetune}}} = \mathcal {L}_{\varvec{T}} + \mathcal {L_{\textbf{w}_\text {recon}}} + \lambda _1 \mathcal {L_{\textbf{w}_\text {reg}}} + \lambda _2 \mathcal {L}_{\varvec{P}_\text {reg}} \end{aligned}$$
(9)

where \(\lambda _2\) is the weight of a shape attribute regularization term which can be written as:

$$\begin{aligned} \mathcal {L}_{\varvec{P}_\text {reg}} = ||\textit{M}_{\textit{F}}( w ) - \textit{M}_{\textit{F}_\text {freeze}}( w )||_2, \end{aligned}$$
(10)

where \(\textit{F}_\text {freeze}\) is the frozen network before finetuning. Here, we introduce a Laplacian feature reconstruction loss with a shape attribute regularization term \(\mathcal {L}_{\varvec{P}_\text {reg}}\); the loss function differs from the loss functions used to train \(M_F\) and \(M_B\). In Fig. 4, we show the reconstructed shapes based on PartVAE latent codes predicted with or without \(\mathcal {L}_{\varvec{P}_\text {reg}}\). \(\mathcal {L}_{\varvec{P}_\text {reg}}\) prevents substantial deviation of the PartVAE latent codes from reasonable 3D shapes. The regularization term \(\mathcal {L}_{\varvec{P}_\text {reg}}\) has a significant impact because it helps to align the original latent spaces learned from forward and backward mapping functions that are often not well aligned. By adding this term, we encourage the finetuning process to focus on the final latent space around the learned forward mapping function. This alignment leads to better performance.

Fig. 4
figure 4

a Input image I; b is the mapped shape of \(w_I\) without \(\mathcal {L}_{\varvec{P}_\text {reg}}\); c is the mapped shape of \(w_I\) with \(\mathcal {L}_{\varvec{P}_\text {reg}}\); d is the target shape rendered in the input image

Fig. 5
figure 5

Part replacement editing process based on our pipeline

4 Shape part manipulation

4.1 Part replacement

The first manipulation task is to replace parts of the input shape. Given the source image \(I_{\text {source}}\), the user aims to replace one part (e.g., chair back) in \(I_{\text {source}}\) with the corresponding part in the target image \(I_{\text {target}}\). The part replacement procedure of our mapping framework is illustrated in Fig. 5. First, we obtain the image latent codes (\(w_{\text {source}}\) and \(w_{\text {target}}\)) through image inversion and then we use the pretrained \(M_F\) to obtain the shape attributes \(\varvec{S}_{\text {source}}\) and \(\varvec{S}_{\text {target}}\) of the shape in both input images. A user can select which part (e.g., back, seat, or leg) of the shape in \(I_{\text {source}}\) he/she wants to replace. The corresponding shape attribute representing the selected part is then replaced with the target shape attribute \(\varvec{S}'_{\text {source}}=\{\varvec{P}_0^{\text {source}}, \varvec{P}_1^{\text {target}},... \varvec{P}_{n-1}^{\text {source}}\}\). Finally, we synthesize the edited image \({I}'= G(M_B({\varvec{S}}', \varvec{V}))\), where \(\varvec{V}\) is the original viewing angle vector. The new shape in the manipulated image will contain the part selected from \(I_{\text {target}}\), while the non-selected parts will remain as close as possible to the original parts in \(I_{\text {source}}\).

4.2 Part resizing

The second manipulation task is part resizing. A user directly resizes a selected part in the input image. We first invert an input image I to obtain its GAN latent code \(w_I\) and map \(w_I\) to the shape attribute \(\varvec{S}\) by \(M_F\). The user can resize the selected part by following a certain trajectory in the latent space and obtain the resulted image \(I_{\text {resize}}\):

$$\begin{aligned} I_{resize} = G(w_{I}+\mathcal {F}_{r}(r^{\varvec{P}})), \end{aligned}$$
(11)

where \({r}^{\varvec{P}}\) is the trajectory of the PartVAE latent code that represents the desired resizing result, and \(\mathcal {F}_{r}\) is a trajectory finetuner function that refines a PartVAE code trajectory into a GAN latent code trajectory.

4.2.1 Resize trajectory

Trajectories that fit the desired resize manipulation in \(\mathcal {S}\) and \(\mathcal {W}\) are obtained using the following procedure. Given a shape \(s_i\) in our dataset, we obtain its geometric attribute by \(\varvec{P}_\text {i}={\text {Enc}}_{P}(\mathcal {T}^{-1}(s_i))\), where \({\text {Enc}}_{P}\) is a pretrained PartVAE encoder. Then we apply a specific resizing operation to \(s_i\) to obtain the resized shape \(\hat{s}_i = \mathcal {R}(s_i)\) and its geometric attribute \(\hat{\varvec{P}}_i={\text {Enc}}_{P}(\mathcal {T}^{-1}(\hat{s}_i))\). Thus, we obtain a PartVAE latent trajectory for this resize manipulation of \(s_i\) by \(r_i^{\varvec{P}}=\hat{\varvec{P}}_i-\varvec{P}_i\). We apply the same resize manipulation to all shapes in our dataset and average all PartVAE latent trajectories to obtain a general trajectory \(r^{\varvec{P}}=(1/N)\sum _{N}(r_i^{\varvec{P}})\) representing the resizing manipulation. By applying \(r^{\varvec{P}}\), we observe that the resized images often lose some details, thus impairing shape identify (Fig. 7). For better shape identity after part resizing, we introduce a GAN space trajectory finetuner implemented using a four-layer MLP. The main function of this finetuner is to transform the trajectories from \(\mathcal {S}\) to \(\mathcal {W}\). The finetuner inputs are the GAN latent code of input image \(w_I\) and a PartVAE latent trajectory of the target part, while the output is a trajectory in \(\mathcal {W}\). The training data of this trajectory finetuner is paired trajectory data \((r^{\varvec{P}}, r_i^{\varvec{W}})\). To collect the paired trajectory data, we first add \(r^{\varvec{P}}\) to the PartVAE latent codes of all training shapes and obtain the rendered image latent codes (i.e., \(\hat{w}_i\)) by optimizing Eq. 1. For shape i, we obtain the GAN latent space trajectory \(r_i^{\varvec{W}}\) corresponding to \(r^{\varvec{P}}\) by \(r_i^{\varvec{W}} = \hat{w}_i - w_I\). We train this trajectory finetuner by minimizing the following loss function:

$$\begin{aligned} \mathcal {L}_r=\Vert \mathcal {F}_{r}(w_I, r^{\varvec{P}})- r_i^{\varvec{W}} \Vert ^2_2. \end{aligned}$$
(12)

We use the shapes in the datasets mentioned in Sect. 5.1 as the original shapes. For each part, we resize shapes by adding three weights (\(-\)0.5\(r^{\varvec{P}}\), +0.5\(r^{\varvec{P}}\), +1.0\(r^{\varvec{P}}\)) to the specific shape attribute trajectory. There are 10, 125 (\(3,375\times 3\)) shapes for each part of chair and guitar, and 30, 375 (\(10,125 \times 3\) parts) shapes in total. There are 7, 500 (\(2,500 \times 3\)) shapes for each part of car and cup, and a total of 21, 500 (\(7500\times 3\) parts) and 15, 000 (\(7500. \times 2\) parts) shapes, respectively.

4.3 Shape orientation manipulation

The third editing task is shape orientation manipulation. To achieve this, we first train a shape orientation prediction network (\(M_{V}\)) to predict the orientation of an image I from its GAN latent code \(w_{I}\), i.e., \(\varvec{V}=M_{V}(w_{I})\). The shape orientation prediction network is an eight-layer MLP trained with the cross-entropy loss. For each category, we use the rendered images described in Sect. 3.3 and collect 36, 000 paired training data (wv) from 3000 shapes in 12 different orientations.

To manipulate the orientation of input image I, we obtain the shape attribute \(\varvec{S}\) of the shape in the input image using \(M_F\). We obtain the edited image with the manipulated orientation vector: \({I}'=G(M_B((\varvec{S}, \varvec{V}')))\).

5 Experiment and results

5.1 Implementation details

For each man-made object category, we trained the StyleGAN-ADA generator using a batch size of 64 and learning rate of 0.0025. Both of our forward and backward mapping networks were trained using the Adam [55] optimizer for 3000 epochs and a batch size 64. All networks were trained using a Tesla V100 GPU. We implemented our pipeline using PyTorch [56]. The forward mapping network were trained for 0.5-3 days, depending on the number of part labels. The backward mapping network was trained for 20 hours, and the forward and backward mapping networks were finetuned together for 10 hours. Baseline disentangled network We use a conditional variational autoencoder (cVAE) as the baseline disentangled method (we provide the architecture in the supplemental material) and compare our results with this baseline on image shape reconstruction and the results of viewing angle manipulation quantitatively. For each shape, we include three factors: part identity, part size, and viewing angle. All these factors are modeled as discrete variables and represented as one-hot vectors during training and testing.

5.2 Image shape reconstruction

The key contribution of the proposed framework is the mapping functions between the GAN latent space and the shape attribute latent space. We qualitatively and quantitatively evaluated these mapping functions through image shape reconstruction.

5.2.1 With/without size finetuning for \(M_F\)

As described in Sect. 3.3, after the Laplacian feature reconstruction training, we performed an additional size finetuning for \(M_F\) with vertex coordinate loss for a better reconstruction. To test the effectiveness of the size finetuning, we evaluated the reconstruction quality of the front view of the testing images described in Sect. 3.3. We sampled 4000 points on each shape and calculated the bidirectional Chamfer distance as our shape reconstruction error metric, and we used perceptual loss as the image reconstruction error metric. Both the image reconstruction and the shape reconstruction errors were lower when the network was trained with size finetuning (Table 1).

Table 1 Comparison of the mean shape reconstruction errors \((E_{s})\) and mean image reconstruction errors \((E_{i})\) of results obtained from with and without size finetuning, and the baseline conditional VAE (cVAE) method \(\mathcal {B}\). We did not report the mean shape reconstruction error for the baseline method because it did not reconstruct an explicit shape

5.2.2 Finetuning for \(M_F\) and \(M_B\)

After separately training \(M_F\) and \(M_B\), we performed an end-to-end finetuning to improve the reconstruction results. We compared the full and non-finetuned versions of our method. The full version included the shape attribute regularizer \(\mathcal {L}_{\varvec{P}_\text {reg}}\) to prevent the shape from collapsing. We tested on about 300 images in each category and obtained the mean perceptual losses of finetuned networks and non-finetuned networks. Through this step, the image reconstruction errors for chair, car, cup, and guitar categories were reduced by \(28\%\), \(37\%\), \(27\%\), and \(32\%\) respectively.

5.3 Shape part manipulation results

5.3.1 Part replacement results

We randomly picked 12 different chair images and exchanged their parts. We applied several replacement patterns to all possible combinations of the picked 12 images and produced edited images for each category. We show some replacement results in Fig. 6. Detailed results are available in the supplementary material.

Fig. 6
figure 6

Results of part replacement. The images were produced from the attributes of shapes 1 and 2

Fig. 7
figure 7

Part resizing results for different categories. a is the input image, b resized image directly obtained using \(P+r^{\varvec{P}}\); c resized image obtained using GAN space trajectory finetuner; d is the inverted result of ground truth, and e is the rendered ground truth

Table 2 Quantitative results for the part resizing manipulation. Comparison of mean perceptual losses of results obtained using shape attribute trajectory (\(r^{\varvec{P}}\)) and the GAN space trajectory finetuner (\(F_{r}\))
Fig. 8
figure 8

Shape orientation manipulation results. Our method can synthesize images of the shape from different viewing angles

5.3.2 Part resizing results

We trained separate GAN latent trajectory finetuners for each part using the resizing dataset discussed in Sect. 4.2. We compared the test results obtained by two methods: using shape attribute trajectory (\(r^{\varvec{P}}\)) and the GAN space trajectory finetuner (\(\mathcal {F}_{r}\)). The comparison results and mean perceptual losses are shown in Fig. 7 and Table 2, respectively. The manipulation result obtained from \(\mathcal {F}_{r}\) showed less perceptual loss and matched the desired resize manipulation better than the results obtained from \(r^{\varvec{P}}\). Moreover, we applied multiple trajectory finetuners trained on different parts of the same image to simultaneously resize multiple parts. Regarding the failure case (last row in Fig. 7), it fails because the target resized image was outside the GAN latent space, and the finetuner could not find a good trajectory to fit the target image.

Table 3 Shape orientation manipulation quantitative evaluation

5.3.3 Shape orientation manipulation results

Figure 8 shows the shape orientation manipulation results. By manipulating the one-hot vector of the shape orientation of a given image, our method can produce the images of different shape orientations. We tested the images described in Sect. 3.3 against the baseline method. For each shape, each method synthesizes images from 12 different orientations as described in Sect. 5.3.3. In Table 3, we show mean perceptual losses across different orientations. The results of our method obtain lower perceptual losses compared to the results of the baseline method (cVAE) and pixelNeRF [21]. We trained a 1-view pixelNeRF using our training data. This result suggested that the explicit shape attribute latent space helps to build up better geometries compared to random latent vectors and implicit geometries learned by radiance field-based method [21]. The shape identities of the manipulated results (Fig. 8) were well maintained. We showed the visual comparisons of our results and the results of pixelNeRF [21] in Fig. 9. The results of pixelNeRF often fail to reconstruct the geometry details of shape parts; thus, the shape identities are not maintained.

Fig. 9
figure 9

Comparison with pixelNeRF  [21]. Our method synthesized shapes with more details while pixelNeRF only synthesize blurry shapes using the same input orientation.

Fig. 10
figure 10

The first row shows the manipulated results. The second row shows their shape images inferred through shape-specific finetune. a Input real image; b shape and image reconstructed using our pipeline; c seat replacement result; d resizing result of the wider back and e the wider seat

5.4 Real image results

We tested our pipeline for shape part manipulation of real images. To infer the shape attributes of an object in the input real image I, we first colorized the shape region in I so that it had the same appearance as our dataset. We projected the colorized image \(\tilde{I}\) into the GAN latent space to obtain its GAN latent code \(w_{\tilde{I}}\). Because the shape in \(\tilde{I}\) was often outside the domain covered by our training dataset, our pipeline yielded unsatisfactory results (without shape-specific finetune row in Fig. 10). To address this problem, we designed a shape-specific finetuning process. We identified the best forward and backward mapping network parameters \(\theta _{M_F}'\) and \(\theta _{M_B}'\) for \(\tilde{I}\) by optimizing the following function:

$$\begin{aligned} \mathcal {L}_{LPIPS}(I', I)+\lambda _3 \theta _{M_F}' - \theta _{M_{{F}}2} + \lambda _4 \theta _{M_B}' - \theta _{{M_B}2}, \end{aligned}$$
(13)

where \(\theta _{M_F}\) and \(\theta _{M_B}\) are the parameters of the pretrained forward and backward mapping functions, and \(\lambda _3\) and \(\lambda _4\) are the weights of losses representing the distances between the pretrained mapping functions and optimized mapping functions. Figure 10 shows the manipulated results for a real image obtained through the shape-specific finetuning process. After obtaining the manipulated image, we warped the input texture in the source real image using TPS warping. We identified the object contours in the input image and the manipulated images using [57], and sampled points on the contours as the control points for TPS warping.

6 Conclusion and limitation

In this paper, we propose a framework for bridging the image latent space and 3D shape attribute space using shape-consistent mapping functions. Furthermore, we show that the mapping functions enable image shape part manipulation subtasks such as part replacement, part resizing, and shape orientation manipulation without the need of any 3D workflow. Fixed shape attribute space. Our trained mapping function assumed fixed the number of parts, which is inherited from SDMNet [31]. In future, we plan to explore a more flexible shape attribute space. Gap between synthesized images and real images. Our framework focuses on the geometry of the input synthesized image. To manipulate the shape of a real image, we need to infer the appearance of the deformed textures to achieve realistic manipulation results. We plan to learn mapping functions between realistic image with rendered image using image translation network [58].