StylePart: Image-based Shape Part Manipulation

Due to a lack of image-based"part controllers", shape manipulation of man-made shape images, such as resizing the backrest of a chair or replacing a cup handle is not intuitive. To tackle this problem, we present StylePart, a framework that enables direct shape manipulation of an image by leveraging generative models of both images and 3D shapes. Our key contribution is a shape-consistent latent mapping function that connects the image generative latent space and the 3D man-made shape attribute latent space. Our method"forwardly maps"the image content to its corresponding 3D shape attributes, where the shape part can be easily manipulated. The attribute codes of the manipulated 3D shape are then"backwardly mapped"to the image latent code to obtain the final manipulated image. We demonstrate our approach through various manipulation tasks, including part replacement, part resizing, and viewpoint manipulation, and evaluate its effectiveness through extensive ablation studies.

p r o j e c t GAN l a t e n t s p a c e p a r t s h a p e a t t r i b u t e l a t e n t s a p c e s

Introduction
Manipulating 3D objects based on a single photo is a challenging task.It involves various complex subtasks: e.g., extracting the target shape from the background; estimating the 3D pose, shape, and materials of the object; estimating the lighting conditions in the scene through image observation; and extracting controllers to manipulate the estimated object properties.Recently, several advanced deep learning-based methods have been developed to perform numerous subtasks, including object segmentation [HGDG17]; object pose [SSH20], shape [FFBB21, GG19, LSG * 20], and material estimation [LCY * 17]; and lighting estimation [GSH * 19, ZZLS21].
Moreover, generative adversarial networks (GANs) have opened up a new high-fidelity image generation paradigm.For example, because of its disentangled style space, StyleGAN [KLA19] can produce high-resolution facial images with unmatched photorealism and support stylized manipulation.A user can transform the generated outputs by tweaking the latent code [SGTZ20, SZ21, HHLP20, JCI20].A user can also edit a natural image by projecting it into the GAN image latent space, finding a latent code that reconstructs the input image and then modifying that code [ZKSE16,HSSI21].
Semantic 3D controllers are introduced for semantic parameter control over images.For example, for generated [TEB * 20b] and natural [TEB * 20a,TZK * 17] human face images, the 3D morphable face model (3DMM) has been used to control facial shape, facial expression, head orientation and scene illumination.However, the proposed methods apply only to portrait face images.
In this paper, we present StylePart, which investigates how to achieve part-based manipulation of man-made objects in a single image.The key insight of our method is to augment a 3D generative model representation that is controllable and with semantic information with the image latent space.We propose a shape-consistent mapping function that connects the image generative latent space and latent space of the 3D man-made shape attribute.The shapeconsistent mapping function is composed of forward and backward mapping functions.We "forwardly map" the input image from the image latent code space to the shape attribute space, where the shape can be easily manipulated.The manipulated shape is remapped back to the image latent code space and synthesized using a pretrained StyleGAN.We also propose a novel training strategy that guarantees the soundness of the mapped 3D shape structure.
Note that we focus on the subtask of extracting controllers to manipulate the shape only, instead of proposing a full image manipulation system that tackle all subtasks altogether.Thus we use images with a simple rendering style without complicated lighting conditions and material properties to demonstrate our results.We evaluate our method through shape reconstruction tests, considering four man-made objectcategories (chair, cup, car, and guitar).We also present several identity-preserving manipulated results of three shape-part manipulation tasks: including part replacement, part resizing, and viewpoint manipulation.

Image-based shape reconstruction and editing
Three-dimensional modeling based on a single photo has been a challenge in the field of computer graphics and computer vision.Several studies have investigated shape inference from multiple images [DTM96,SSS06] and single photos for different shape representations, such as voxel [WWX * 17, YYY * 16], point cloud [SFG17], mesh [GG19, MDG * 20, LSG * 20], and simple primitives [CZS * 13].With advances in deep learning methods, the quality of reconstructed shapes and images has improved tremendously; however, they are usually not semantically controllable.
Chen et al. [CZS * 13] proposed 3-Sweep, an interactive method for shape extraction and manipulation in a photo.A user can create 3D primitives (cylinders and cuboids) using 3-Sweep and manipulate the photo content using extracted primitives.Banerjee et al. [KSES14] enable users to manually aligned a publicly available 3D model to guide the completion of geometry and light estimations.Unlike previous methods, our method leverages recent advances in part-based generative shape representations [GYW * 19, MGY * 19] to automatically infer shape attributes.Moreover, the shape parts are more complex than simple cylinders and cuboids.

GAN inversion and latent space manipulation
GAN inversion is required to edit a real image through latent manipulation.GAN inversion identifies the latent vector from which the generator can best replicate the input image.Inversion methods can typically be divided into optimization-and encoder-based methods.Optimization-based methods which directly optimize the latent code using a single sample [AQW19, AQW20, CB18, HZZ * 20], whereas encoder-based methods train an encoder over a large number of samples [GTN * 20, TAN * 21, APCO21].Some recent works have augmented the inversion process with additional semantic constraints [ZSZZ20] and additional latent codes [GSZ20].Among these works, many have specifically considered StyleGAN inversion and investigated latent spaces with different disentanglement abilities, such as W [KLA * 20], W + [AQW19, AQW20], and style space (S) [WLS21].In our work, we use StyleGAN-ADA [KAH * 20a] as the generator and we follow its original inversion process in the W space.Moreover, we use additional shape attribute spaces to facilitate semantic disentanglement and support various image-based shape manipulation tasks.
Several works have examined semantic directions in the latent spaces of pretrained GANs.Full-supervision using semantic labels [GAOI19, SGTZ20] and self-supervised approaches [JCI19, PBH20, SEBM20] have been employed.Moreover, several recent studies have utilized unsupervised methods to obtain semantic directions [WP21,HHLP20,SZ21].A series of works focused on real human facial editing [TEB * 20b, TEB * 20a] and non-photorealistic faces [WCH * 21]; they utilized a prior in the form of a 3D morphable face model.

Method
We aim to enable users to directly edit the structure of a man-made shape in an image.We first describe the background of the 3D shape representation method, which allows for tight semantic control of a 3D man-made shape.We then describe a new neural network architecture that maps latent vectors between the image and 3D shape domains, and highlight the different shape part manipulation methods of the architecture.

3D shape attribute representation
We adopt a structured deformable mesh generative network (SDM-NET) [GYW * 19] to represent the man-made shape attributes.The network allows for tight semantic control of different shape parts.Moreover, it generates a spatial arrangement of closed, deformable mesh parts, which represents the global part structure of a shape collection, e.g., chair, table, and airplane.In SDM-NET, a complete man-made shape is generated using a two-level architecture comprising a part variational autoencoder (PartVAE) and structured parts VAE (SP-VAE).PartVAE learns the deformation of a single part using its Laplacian feature vector, extracted through the method established in [GLY * 19], and SP-VAE jointly learns the deformation of all parts of a shape and the structural relationship between them.In this work, given an input shape s of category c with nc as the number of parts, we represent its shape attributes S = (P, T) using both

Shape-consistent mapping framework
At the core of our image-based shape part manipulation pipeline (shown in Figure 2) is a cross-domain mapping structure between the image latent space W and the shape attribute space S. Given an input image containing a man-made object (I), the image inversion method is first applied to optimize a GAN latent code (w I ) that can best reconstruct the image.Then, the corresponding shape attributes (S) are obtained using a forward shape-consistent latent mapping function (M F ).The viewing angle of the input image is predicted using a pretrained viewing angle predictor (M V ).From the mapped geometric attribute, topology attribute, and the predicted viewing angle, we can perform image-based shape editing tasks, such as part replacement, part deformation, and viewing angle manipulation by manipulating the attribute codes.After manipulating the attribute codes, we use the backward mapping function (M B ) to map the shape attributes to an image latent code (w ).The final edited image can be synthesized as I w = G(w ).

Image inversion
To obtain the latent code of the input image I, we optimize the following objective: where G is a pretrained StyleGAN-ADA [KAH * 20b] generator with weight θ, L LPIPS denotes the perceptual loss, and Lw reg = w 2 2 denotes the latent code regularization loss.We introduced Lw reg above because we observed that there are multiple latent codes that can synthesize the same image.By introducing wreg, we obtain a unique latent code for each input image.

From W to S
Given a GAN latent code w I , which is inverted from an input image I, we obtain the corresponding shape attribute code S = (P, T) using the forward mapping network (P, T) = M F (w I ), where the 3D shape generated by (P, T) best fits the target man-made shape encoded in the image latent code w I .

From S to W
Given the shape attribute S and one-hot vector of a viewing angle v, the backward mapping function predicts a GAN latent code w = M B (S, v), where the synthesized image I w = G(w ) best matches the image containing the target shape described in S with viewing angle v.
Our mapping functions (M F and M B ) are realized using two eight-layer MLP networks (see the supplemental material for the detailed architecture).

Data preparation
To train our shape-consistent mapping function, both the imagebased latent code and the shape attribute code for each man-made object shape are required.In this paper, we use synthetic datasets of four categories (chair, guitar, car, and cup).A dataset contains N M shapes, formed by interchanging the M parts of N shapes.We render these shapes according to different viewing angles to obtain the paired shapes and images.We sample the viewing angles at 30 • intervals on the yaw axis.For the chair and guitar category, N = 15 and M = 3; for the cup and car categories, N = 50 and M = 2.There are 3, 375 shapes and 40, 500 images for the chair and guitar categories, and 2, 500 shapes and 30, 000 images for the car and cup category.We split the synthetic datasets for each category into 80% training data and 20% testing data For shape k, we first prepare the shape attribute code S k .Each part is represented by a feature matrix f ∈ R 9×V that describes the per-vertex deformation of a template cube with V = 3, 752 vertices.Let nc denote the number of the parts for category c, and zpart = 128 is the dimension of a PartVAE latent code.We embed the feature matrix of part i ( f i ) into a pretrained PartVAE and obtain P i = Enc Pi ( f i ).We prepare the same topology vector T ∈ R 2n+73 for shapes in the same category.Finally, for each shape, we concatenate the PartVAE codes of all parts and the topology vector into the final shape attribute vector S = (P 0 , P 1 , ..., P n−1 , T).
Next, we prepare the image-based latent code for shape k.We render all shapes from 12 viewing angles and use these images to train a StyleGAN2 through adaptive discriminator augmentation [KAH * 20b], which depends on the image's viewing angle.We project the rendered image containing shape k into the pretrained conditional StyleGAN latent space to obtain the corresponding latent code w k ∈ R 512 .We collect all (w k , S) pairs for training both M F and M B .

Training for M F
We train the forward mapping function M F using a two-step process comprising Laplacian feature reconstruction training and size finetuning.In the Laplacian feature reconstruction training step, we use the following loss function: where L Precon and L T are defined as follows: (3) Here, (P , T ) are the shape attrbutes predicted by M F , Dec Pi (•) denotes the pretrained PartVAE decoder of the i-th part, and N denotes the number of parts.In the second step, we replace the reconstructed feature vectors from the Laplacian feature to the vertex coordinates.As shown in Figure 3, we observed the Laplacian feature vectors are sensitive to local differences but insensitive to global shape differences.The loss function in this step can be written as: where T transforms a vertex Laplacian feature vector into its vertex coordinates according to the steps described in [GLY * 19].In contrast to Laplacian feature differences, vertex coordinate differences capture the deformations.

Training for M B
Our backward mapping function M B minimizes the following loss: where λ 1 is the weight of the l 2 norm regularization term of the image latent code w (the same weight used in image inversion in Eq. 1), and Lw recon is defined as: where G is the pretrained Style-ADA generator, w is the mapped image latent code of (P, T), and L LPIPS (•) is the perceptual loss function.

Finetuning for M F and M B
After training M F and M B separately, we finetune them together using the following loss function: where λ 2 is the weight of a shape attribute regularization term which can be written as: Here, we introduce a Laplacian feature reconstruction loss with a shape attribute regularization term L Preg ; the loss function differs from the loss functions used to train M F and M B .In Figure 4, we show the reconstructed shapes based on PartVAE latent codes predicted with or without L Preg .The shape attribute regularization term prevents substantial deviation of the PartVAE latent codes from reasonable 3D shapes.

Shape part manipulation
We propose three types of manipulations: part replacement, part deformation, and viewing angle manipulation.

Part replacement
The first manipulation task is to replace parts of the input shape.
Given the source image Isource, the user aims to replace one part , ...P source n−1 }.Finally, we synthesize the edited image I = G(M B (S , V)), where V is the original viewing angle vector.The new shape in the manipulated image will contain the part selected from Itarget, while the non-selected parts will remain as close as possible to the original parts in Isource.

Part resizing
The second manipulation task is part resizing.A user directly resizes a selected part in the input image.We first invert an input image I to obtain its GAN latent code w I and map w I to the shape attribute S by M F .The user can resize the selected part by following a certain trajectory in the latent space and obtain the resulted image I resize : where r P is the trajectory of the PartVAE latent code that represents the desired resizing result, and Fr is a trajectory finetuner function that refines a PartVAE code trajectory into a GAN latent code trajectory.

Resize trajectory
Trajectories that fit the desired resize manipulation in S and W are obtained using the following procedure.Given a shape s i in our dataset, we obtain its geometric attribute by where Enc P is a pretrained PartVAE encoder.Then we apply a specific resizing operation to s i to obtain the resized shape ŝi = R(s i ) and its geometric attribute Pi = Enc P (T −1 ( ŝi )).Thus, we obtain a PartVAE latent trajectory for this resize manipulation of s i by r P i = Pi − P i .We apply the same resize manipulation to all shapes in our dataset and average all PartVAE latent trajectories to obtain a general trajectory r P = (1/N) ∑ N (r P i ) representing the resizing manipulation.By applying r P , we observe that the resized images often lose some details, thus impairing shape identify (Figure 9).For better shape identity after part resizing, we introduce a GAN space trajectory finetuner implemented using a four-layer MLP.The main function of this finetuner is to transform the trajectories from S to W. The finetuner inputs are the GAN latent code of input image w I and a PartVAE latent trajectory of the target part, while the output is a trajectory in W. We train this trajectory finetuner using paired trajectory data (r P , r W i ) and by minimizing the following loss function: To collect the paired trajectory data, we first add r P to the PartVAE latent codes of all training shapes and obtain the rendered image latent codes (i.e., ŵi ) by optimizing Eq.1.For shape i, we obtain the GAN latent space trajectory r W i corresponding to r P by r W i = ŵi − w I .

Viewing angle manipulation
The third editing task is viewing angle manipulation.To achieve this, we first train a viewing angle prediction network (M V ) to predict the viewing angle of an image I from its GAN latent code w I , i.e., V = M V (w I ).The viewing angle prediction network is an eight-layer MLP trained with the cross-entropy loss.For each category, we use the rendered images described in Section 3.3 and collect 36, 000 paired training data (w, v) from 3, 000 shapes in 12 different viewing angles.
To manipulate the viewing angle of input image I, we obtain the shape attribute S of the shape in the input image using M F .We obtain the edited image with the manipulated viewing angle vector:

Implementation details
For each man-made object category, we trained the StyleGAN-ADA generator using a batch size of 64 and learning rate of 0.0025.Both of our forward and backward mapping networks were trained using the Adam [KB14] optimizer for 3, 000 epochs and a batch size 64.All networks were trained using a Tesla V100 GPU.We implemented our pipeline using PyTorch [PGM * 19].The forward mapping network were trained for 0.5-3 days, depending on the number of part labels.The backward mapping network were trained for 20 hours, and the forward and backward mapping networks were finetuned together for 10 hours.
Baseline disentangled network We use a conditional variational autoencoder (cVAE) as the baseline disentangled method and compare our results with this baseline on image shape reconstruction and the results of viewing angle manipulation quantitatively.For each shape, we include three factors: part identity, part size, and viewing angle.All these factors are modeled as discrete variables and represented as one-hot vectors during training and testing.

Image shape reconstruction
The key contribution of the proposed framework are the mapping functions between the GAN latent space and the shape attribute latent space.We qualitatively and quantitatively evaluated these mapping functions through image shape reconstruction.

With/Without size finetuning for M F
As described in Section 3.3, after the Laplacian feature reconstruction training, we performed an additional size finetuning for M F with vertex coordinate loss for a better reconstruction.To test the effectiveness of the size finetuning, we evaluated the reconstruction quality of the front view of the testing images described in Section 3.3.We sampled 4, 000 points on each shape and calculated the bidirectional Chamfer distance as our shape reconstruction error metric, and we used perceptual loss as the image reconstruction error metric.Both the image reconstruction and the shape reconstruction errors were lower when the network was trained with size finetuning (Table 1(a)).The image reconstruction perceptual loss distribution (Figure 7) also showed that for the chair and cup categories, the error distributions were significantly lower when the network was trained with size finetuning.

Finetuning for M F and M B
After separately training M F and M B , we performed an end-toend finetuning to improve the reconstruction results.We compared  and mean image reconstruction errors (E i ) of results obtained from with and without size finetuning, and the baseline conditional VAE (cVAE) method B. We did not report the mean shape reconstruction error fo the baseline method because it did not reconstruct a explicit shape.
the full and non-finetuned versions of our method.The full version included the shape attribute regularizer L Preg to prevent the shape from collapsing.We tested on about 300 images in each category and obtained the mean perceptual losses of finetuned networks and non-finetuned networks.Through this step, the image reconstruction errors for chair, car, cup, and guitar categories were reduced by 28%, 37%, 27%, and 32% respectively.

Part replacement results
We randomly picked 12 different chair images and exchanged their parts.We applied several replacement patterns to all possible combinations of the picked 12 images and produced edited images for each category.We show some replacement results in Figure 8. Detailed results are available in the supplementary material.

Part resizing results
We trained separate GAN latent trajectory finetuners for each part using the resizing dataset discussed in Section 4.2.We compared the test results obtained by two methods: using shape attribute trajectory (r P ) and the GAN space trajectory finetuner (Fr).The com-  parison results and mean perceptual losses are shown in Figure 9 and Table 2, respectively.The manipulation result obtained from Fr showed less perceptual loss and matched the desired resize manipulation better than the results obtained from r P .Moreover, we applied multiple trajectory finetuners trained on different parts of the same image to simultaneously resize multiple parts.Regarding the failure case (last row in Figure 9), it fails because the target resized image was outside the GAN latent space, and the finetuner could not find a good trajectory to fit the target image.

Viewing angle manipulation results
Figure 10 shows the viewing angle manipulation results.By manipulating the one-hot vector of the viewing angle of a given image, our method can produce the images from different viewing angles.We tested the images described in Section 3.3 against the  Comparison of mean perceptual losses of results obtained using shape attribute trajectory (r P ) and the GAN space trajectory finetuner (Fr).
baseline method.For each shape, each method synthesizes images from 12 different viewing angles as described in Section 5.3.3.In Table 3, we show mean perceptual losses across different viewing angles.The results of our method obtain lower perceptual losses compared to the results of the baseline method (cVAE) and pixel-NeRF [YYTK21].We trained a 1-view pixelNeRF using our training data.This result suggested that the explicit shape attribute latent space helps to build up better geometries compared to random latent vectors and implicit geometries learned by radiance fieldbased method [YYTK21].The shape identities of the manipulated results (Figure 10) were well maintained.In Figure 11, we showed the visual comparisons of our results and the results of pix-elNeRF [YYTK21].The results of pixelNeRF often fail to reconstruct the geometry details of shape parts, thus the shape identifies are not maintained.

Real image results
We tested our pipeline for shape part manipulation of real images To infer the shape attributes of an object in the input real image I, we first colorized the shape region in I so that it had the same appearance as our dataset.We projected the colorized image Ĩ into the GAN latent space to obtain its GAN latent code w Ĩ .Because the shape in Ĩ was often outside the domain covered by our training dataset, our pipeline yielded unsatisfactory results (without shapespecific finetune row in Figure 12).To address this problem, we 0 °+ 3 0 °+ 6 0 °+ 9 0 °+ 1 2 0 °+ 1 5 0 °± 1 8 0 °-1 5 0 °-1 2 0 °-9 0 °-6 0 °-3 0 °Figure 10: Viewing angle manipulation results.Our method can synthesize images of the shape from different viewing angles.designed a shape-specific finetuning process.We identified the best forward and backward mapping network parameters θ MF and θ MB for Ĩ by optimizing the following function: where θ MF and θ MB are the parameters of the pretrained forward and backward mapping functions, and λ 3 and λ 4 are the weights of losses representing the distances between the pretrained mapping functions and optimized mapping functions.Figure 12 shows the manipulated results for a real image obtained through the shapespecific finetuning process.After obtaining the manipulated image, we warped the input texture in the source real image using thin plate splines (TPS) warping.We identified the object contours in the in- put image and the manipulated images using [S * 85], and sampled points on the contours as the control points for TPS warping.

Limitation and future work
In this paper, we propose a framework for bridging the image latent space and 3D shape attribute space using shape-consistent mapping functions.Furthermore, we show that the mapping functions enable image shape part manipulation, supporting part replacement, part resizing, and viewpoint manipulation.However, despite the usefulness of our framework, as demonstrated through manipulations of several man-made object categories, it still has several limitations.
Entanglement between parts.As shown in Figure 13, some non- manipulated parts were altered after part replacement manipulations.This suggests that some parts were still entangled together in the image latent space.This could potentially be resolved by using more specific shape part supervision signals, such as part masks.Fixed shape attribute space.We trained one end-to-end mapping function for each category and assumed that the number of parts in the input shape was always less than the predefined number.This limitation is inherited from the adopted shape attribute space we used (i.e., SDMNet).In the future, we may explore a more flexible shape attribute space.Gap between synthesized images and real images.Our framework focuses on the geometry of the input synthesized image.To manipulate the shape of a real image, we need to infer the appearance of the deformed textures, including from the unseen sides, to achieve realistic manipulation results.Moreover, we plan to support the lighting manipulation by incorporating lighting parameters into our framework.

Conclusion
In this paper, we propose a framework for bridging the image latent space and 3D shape attribute space using shape-consistent mapping functions.Furthermore, we show that the mapping functions enable image shape part manipulation, supporting part replacement, part resizing, and viewpoint manipulation.
w i n p u t w ma n i p u l a t e . . .+ v i e wi n g a n g l e v e c t o r a n g l e : 0 °a n g l e : -3 0 °Figure

Figure 2 :
Figure 2: The network architecture of our cross-domain mapping framework.

•
Geometric attribute (P ∈ R z×nc ): all the PartVAE latent codes of each shape, where z = 128 is the dimension of the PartVAE latent code adopted in our work.• Topology attribute (T ∈ R 2nc+9 ): the representation vector rv in SDM-Net.This vector represents the geometry and associated relationships of each part in s.Please refer to Section 3.1 in Gao et al. [GYW * 19].

Figure 3 :
Figure 3: Visualization of Laplacian feature differences (a, b) and vertex coordinate differences (c, d) between shape 1 and 2. In contrast to Laplacian feature differences, vertex coordinate differences capture the deformations.
Figure 4: (a) Input image I; (b) is the mapped shape of w I without L Preg ; (c) is the mapped shape of w I with L Preg ; (d) is the target shape rendered in the input image.

i n v e r s i o n i n v e r s i o nFigure 5 :
Figure 5: Part replacement editing process baed on our pipeline.

Figure 6 :
Figure 6: The architecture of the baseline cVAE method.

Figure 7 :
Figure 7: The perceptual loss distributions of the image shape reconstruction with and without size finetuning.

S h a p e 1 S h a p e 2 BFigure 8 :
Figure 8: Results of part replacement.The images were produced from the attributes of shape 1 and 2.

Figure 9 :
Figure 9: Part resizing results for different categories.(a) is the input image, (b) resized image directly obtained using P + r P ; (c) resized image obtained using GAN space trajectory finetuner; (d) is the inverted result of ground truth, and (e) is the rendered ground truth.

Figure 11 :
Figure 11: Comparison with pixelNeRF [YYTK21].Our method synthesized shapes with more details while pixelNeRF only synthesize blurry shapes using the same input view.

Figure 12 :
Figure 12: Editing results of real images.The first row shows the texture mapping results obtained from the shapes inferred through shape-specific finetuning.The second row shows shape images inferred through shape-specific finetune, while the third row shows shape images inferred without shape-specific finetuning.(a) Input real image; (b) shape and image reconstructed using our pipeline; (c) seat replacement result; (d) resizing result of the wider back; (e) resizing result of the wider seat.

Figure 13
Figure 13: (a) The sources of the back and the seat parts (b) the source of the leg part; (c) replacement result; (d) ground truth.The replacement result indicates the entanglement between the seat and the leg parts.

Table 2 :
Quantitative results for the part resizing manipulation.