Keywords

1 Introduction

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild (i.e., from observations of clothed humans), realistic AR/VR experience, movies, video games, virtual try-on, etc.

For the past decades, physics-based simulations have been setting the standard in movie and video game industries, even though they require hours of labor by experts. More recently methods for full clothing reconstruction using multi-view videos or 3D scan systems have also been proposed [38]. Global deformations can be reconstructed with high fidelity semi-automatically. Nevertheless, accurately recovering geometric details such as fine cloth wrinkles has remained a challenge.

Fig. 1.
figure 1

Accurate and realistic clothing modeling with DeepWrinkles, our entirely data-driven framework. (Left) 4D data capture. (Middle Left) Reconstruction from subspace model. (Middle Right) Fine wrinkles in normal map generated by our adversarial neural network. (Right) 3D rendering and animation on virtual human.

In this paper, we present DeepWrinkles (see Fig. 1), a novel framework to generate accurate and realistic clothing deformation from real data capture. It consists of two complementary modules: (1) A statistical model is learned from 3D scans of clothed people in motion, from which clothing templates are precisely non-rigidly aligned. Clothing shape deformations are therefore modeled using a linear subspace model, where human body shape and pose are factored out, hence enabling body retargeting. (2) Fine geometric details are added to normal maps generated using a conditional adversarial network whose architecture is designed to enforce realism and temporal consistency.

To our knowledge, this is the first method that tackles 3D surface geometry refinement using deep neural network on normal maps. With DeepWrinkles, we obtain unprecedented high-quality rendering of clothing deformation, where global shape as well as fine wrinkles from (real) high resolution observations can be recovered, using an entirely data-driven approach. Figure 2 gives an overview of our framework with a T-shirt as example. Additional materials contain videos of results. We show how the model can be applied to virtual human animation, with body shape and pose retargeting.

2 Related Work

Cloth modeling and garment simulation have a long history that dates back to the mid 80s. A general overview of fundamental methods is given in [51]. There are two mostly opposing approaches to this problem. One is using physics-based simulations to generate realistic wrinkles, and the other captures and reconstructs details from real-world data.

Fig. 2.
figure 2

Outline of DeepWrinkles. During Learning, we learn to reconstruct global shape deformations using a Statistical Model and a mapping from pose parameters to Blend shape parameters using real-world data. We also train a neural network (cGAN) to generate fine details on normal maps from lower resolution ones. During runtime, the learned models reconstruct shape and geometric details given a priori body shape and pose. Inputs are in violet, learned models are in cyan. (Color figure online)

Physics-Based Simulation. For the past decades, models relying on Newtonian physics have been widely applied to simulate cloth behavior. They usually model various material properties such as stretch (tension), stiffness, and weight. For certain types of applications (e.g., involving human body) additional models or external forces have to be taken into account such as body kinematics, body surface friction, interpenetration, etc [4, 8, 9, 16]. Note that several models have been integrated in commercial solutions (e.g., Unreal Engine APEX Cloth/Nvidia NvCloth, Unity Cloth, Maya nCloth, MarvelousDesigner, OptiTex, etc.) [32]. Nevertheless, it typically requires hours or days if not weeks of computation, retouching work, and parameter tuning by experts to obtain realistic cloth deformation effects.

3D Capture and Reconstruction. Vision-based approaches have explored ways to capture cloth surface deformation under stress, and estimate material properties through visual observations for simulation purpose [29, 33, 34, 53]. As well, several methods directly reconstruct whole object surface from real-world measurements. [54] uses texture patterns to track and reconstruct garment from a video, while 3D reconstruction can also be obtained from multi-view videos without markers [7, 29, 33, 41, 48, 49]. However without sufficient prior, reconstructed geometry can be quite crude. When the target is known (e.g., clothing type), templates can improve the reconstruction quality [3]. Also, more details can be recovered by applying super-resolution techniques on input images [7, 17, 45, 47], or using photometric stereo and information about lighting [22, 50]. Naturally, depth information can lead to further improvement [14, 36].

In recent work [38], cloth reconstruction is obtained by clothing segmentation and template registration from 4D scan data. Captured garments can be retargeted to different body shapes. However the method has limitations regarding fine wrinkles.

Coarse-to-Fine Approaches. To reconstruct fine details, and consequently handle the bump in resolution at runtime (i.e., higher resolution meshes or more particles for simulation), methods based on dimension reduction (e.g., linear subspace models) [2, 21] or coarse-to-fine strategies are commonly applied [25, 35, 52]. DRAPE [19] automates the process of learning linear subspaces from simulated data and applying them to different subjects. The model factors out body shape and pose to produce a global cloth shape and then applies the wrinkles of seen garments. However, deformations are applied per triangle as in [42], which is not optimal for online applications. Additionally, for all these methods, simulated data is tedious to generate, and accuracy and realism are limited.

Learning Methods. Previously mentioned methods focus on efficients simulation and representation of previously seen data. Going a step further, several methods have attempted to generalize this knowledge to unseen cases. [46] learns bags of dynamical systems to represent and recognize repeating patterns in wrinkle deformations. In DeepGarment [13] the global shape and low frequency details are reconstructed from a single segmented image using a CNN but no retargeting is possible.

Only sparse work has been done on learning to add realistic details to 3D surfaces with neural networks but several methods to enrich facial scans with texture exist [37, 40]. In particular, Generative Adversarial Networks (GANs) [18] are suitable to enhance low dimensional information with details. In [27] it is used create realistic images of clothed people given a (possibly random) pose.

Outside of clothing, SR-GAN [28] solves the super-resolution problem of recovering photo-realistic textures from heavily downsampled images on public benchmark. The task has similarities to ours in generating high frequency details for coarse inputs but we use a content loss motivated by perceptual similarity instead of similarity in pixel space. [10] uses a data-driven approach with a CNN to simulate highly detailed smoke flows. Instead, pix2pix [24] proposes a conditional GAN that creates realistic images from sketches or annotated regions or vice versa. This design suits our problem better as we aim at learning and transferring underlying image structure.

In order to represent the highest possible level of detail at runtime, we propose to revisit the traditional rendering pipeline of 3D engine with computer vision. Our contributions take advantage of the normal mapping technique [11, 12, 26]. Note that displacement maps have been used to create wrinkle maps using texture information [5, 15]. However, while results are visually good on faces, they still require high resolution mesh, and no temporal consistency is guaranteed across time. (Also, faces are arguably less difficult to track than clothing which are prone to occlusions and more loose.)

In this work, we claim the first entirely data-driven method that uses a deep neural network on normal maps to leverage 3D geometry of clothing.

3 Deformation Subspace Model

We model cloth deformations by learning a linear subspace model that factors out body pose and shape, as in [19]. However, our model is learned from real data, and deformations are applied per vertex for speed and flexibility regarding graphics pipelines [31]. Our strategy ensures deformations are represented compactly and with high realism. First, we compute robust template-based non-rigid registrations from a 4D scan sequence (Sect. 3.1), then a clothing deformation statistical model is derived (Sect. 3.2) and finally, a regression model is learned for pose retargeting (Sect. 3.3).

3.1 Data Preparation

Data Capture. For each type of clothing, we capture 4D scan sequences at 60 fps (e.g., 10.8k frames for 3 min) of a subject in motion, and dressed in a full-body suit with one piece of clothing with colored boundaries on top. Each frame contists of a 3D surface mesh with around 200k vertices yielding very detailed folds on the surface but partially corrupted by holes and noise (see Fig. 1a). This setup allows a simple color-based 3D clothing extraction. In addition, capturing only one garment prevents occlusions where clothing normally overlaps (e.g., waistbands) and clothings can be freely combined with each other.

Body Tracking. 3D body pose is estimated at each frame using a method in the spirit of [44]. We define a skeleton with j joints described by \(p_j\) parameters representing transformation and bone length. Joint parameters are also adjusted to body shape, which is estimated using [31, 55]. Posed human body is obtained using a linear blend skinning function \(S: \mathbb {R}^{3 \times v} \times \mathbb {R}^{p_j} \rightarrow \mathbb {R}^{3 \times v}\) that transforms (any subset of) v vertices of a 3D deformable human template in normalized pose (e.g., T-pose) to a pose defined by j skeleton joints.

Fig. 3.
figure 3

(Left) Before registration, a template is aligned to clothing scans by skinning. Boundaries are misplaced. (Right) Examples of registrations with different shirts on different people. Texture stability across sequence shows the method is robust to drift.

Registration. We define a template of clothing \(\bar{\mathcal {T}}\) by choosing a subset of the human template with consistent topology. \(\bar{\mathcal {T}}\) should contain enough vertices to model deformations (e.g., 5k vertices for a T-shirt), as shown in Fig. 3. The clothing template is then registered to the 4D scan sequence using a variant of non-rigid ICP based on grid deformation [20, 30]. The following objective function \(\mathcal {E}_{reg}\), which aims at optimizing affine transformations of grid nodes, is iteratively minimized using Gauss-Newton method:

$$\begin{aligned} \mathcal {E}_{reg} = \mathcal {E}_{data} + \omega _r\cdot \mathcal {E}_{rigid} + \omega _s\cdot \mathcal {E}_{smooth} + \omega _b\cdot \mathcal {E}_{bound}, \end{aligned}$$
(1)

where the data term \(\mathcal {E}_{data}\) aligns template vertices with their nearest neighbors on the target scans, \(\mathcal {E}_{rigid}\) encourages each triangle deformation to be as rigid as possible, and \(\mathcal {E}_{smooth}\) penalizes inconsistent deformation of neighboring triangles. In addition, we introduce the energy term \(\mathcal {E}_{bound}\) to ensure alignment of boundary vertices, which is unlikely to occur otherwise (see below for details). We set \(\omega _r=500\), \(\omega _s=500\), and \(\omega _b=10\) by experiments. One template registration takes around 15s (using CPU only).

Boundary Alignment. During data capture the boundaries of the clothing are marked in a distinguishable color and corresponding points are assigned to the set \(\mathcal {B_S}\). We call the set of boundary points on the template \(\mathcal {B_T}\). Matching point pairs in \(\mathcal {B_S} \times \mathcal {B_T}\) should be distributed equally among the scan and template, and ideally capture all details in the folds. As this is not the case if each point in \(\mathcal {B_T}\) is simply paired with the closest scan boundary point (see Fig. 4), we select instead a match \(s_t \in \mathcal {B_S}\) for each point \(t \in \mathcal {B_T}\) via the following formula:

$$\begin{aligned} s_t = \max _{s \in \mathcal {C}}\Vert t - s \Vert \quad \text {with } \quad \mathcal {C} = \left\{ s' \in \mathcal {B_S} \mid \arg \min _{t' \in \mathcal {B_T}} \Vert s' - t' \Vert = t \right\} . \end{aligned}$$
(2)

Notice that \(\mathcal {C}\) might be empty. This ensures consistency along the boundary and better captures high frequency details (which are potentially further away).

3.2 Statistical Model

The statistical model is computed using linear subspace decomposition by PCA [31]. Poses \(\{\mathcal {\theta }_1,...,\mathcal {\theta }_n\}\) of all n registered meshes \(\{\mathcal {R}_1,...,\mathcal {R}_n\}\) are factored out from the model by pose-normalization using inverse skinning: \(S^{-1}(\mathcal {R}_i, \mathcal {\theta }_i) = \bar{\mathcal {R}}_i \in \mathbb {R}^{3 \times v}\). In what remains, meshes in normalized pose are marked with a bar. Each registration \(\bar{\mathcal {R}}_i\) can be represented by a mean shape \(\bar{\mathcal {M}}\) and vertex offsets \(o_i\), such that \(\bar{\mathcal {R}}_i = \bar{\mathcal {M}} + o_i\), where the mean shape \(\bar{\mathcal {M}} \in \mathbb {R}^{3 \times v}\) is obtained by averaging vertex positions: \(\bar{\mathcal {M}} = \sum _{i = 1}^n \frac{\bar{\mathcal {R}}_i}{n}\). The n principal directions of the matrix \(O = [ o_1\ \cdots \ o_n ]\) are obtained by singular value decomposition: \(O = U \Sigma V^\top \). Ordered by the largest singular values, the corresponding singular vectors contain information about the most dominant deformations.

Finally, each \(\mathcal {R}_i\) can be compactly represented by \(k \le n\) parameters \(\{\lambda _1^i,...,\lambda _k^i\} \in \mathbb {R}^k\) (instead of its \(3 \times v\) vertex coordinates), with the linear blend shape function B, given a pose \(\mathcal {\theta }_i\):

$$\begin{aligned} B(\{\lambda _1^i,...,\lambda _k^i\}, \mathcal {\theta }_i) = S\left( \bar{\mathcal {M}} + \sum _{l=0}^k \lambda _l^i \cdot V_l, \mathcal {\theta }_i\right) \approx \mathcal {R}_i \in \mathbb {R}^{3 \times v}, \end{aligned}$$
(3)

where \(V_l\) is the l-th singular vector. For a given registration, \(\lambda _l^i = V_l^\top \bar{\mathcal {R}}_i\) holds. In practice, choosing \(k = 500\) is sufficient to represent all registrations with a negligible error (less than 5 mm).

Fig. 4.
figure 4

Strategies for Boundary Alignment. Template boundary points \(\mathcal {B_T}\) are denoted in black, scan boundary points \(\mathcal {B_S}\) (with significant more points than the template) are in red and blue. Points that are paired with a template point are in blue. (Left) Pairing to the closest neighbor of the template points leads to ignorance of distant details. (Right) Each template point in \(\mathcal {B_T}\) is paired with the furthest point (marked in blue) in a set containing its closest points in \(\mathcal {B_S}\). (Color figure online)

3.3 Pose-to-Shape Prediction

We now learn a predictive model f, that takes as inputs j joint poses, and outputs a set of k shape parameters \(\varLambda \). This allows powerful applications where deformations are induced by pose. To take into account deformation dynamics that occur during human motion, the model is also trained with pose velocity, acceleration, and shape parameter history. These inputs are concatenated in the control vector \(\varTheta \), and f can be obtained using autoregressive models [2, 31, 39].

In our experiments with clothing, we solved for f in a straightforward way by linear regression: \(F = \varLambda \cdot \varTheta ^\dagger \), where F is the matrix representation of f, and \(\dagger \) indicates the Moore-Penrose inverse. While this allows for (limited) pose retargeting, we observed loss in reconstruction details. One reason is that under motion, the same pose can give rise to various configurations of folds depending on the direction of movement, speed and previous fold configurations.

To obtain non-linear mapping, we consider the components of \(\varTheta \) and \(\varLambda \) as multivariate time series, and train a deep multi-layer recurrent neural network (RNN) [43]. A sequence-to-sequence encoder-decoder architecture with Long Short-term Memory (LSTM) units is well suited as it allows continuous predictions, while being easier to train than RNNs and outperforming shallow LSTMs. We compose \(\varTheta \) with j joint parameter poses, plus velocity and acceleration of the joint root. MSE compared to linear regression are reported in Sect. 5.3.

Fig. 5.
figure 5

Limits of registration and subspace model. (Left) Global shape is well recovered, but many visible (high frequency) details are missing. (Right) Increasing the resolution of the template mesh is still not sufficient. Note that [38] suffers from the same limitations.

4 Fine Wrinkle Generation

Our goal is to recover all observable geometric details. As previously mentioned, template-based methods [38] and subspace-based methods [19, 21] cannot recover every detail such as fine cloth wrinkles due to resolution and data scaling limitations, as illustrated in Fig. 5.

Assuming the finest details are captured at sensor image pixel resolution, and are reconstructed in 3D (e.g., using a 4D scanner as in [6, 38]), all existing geometric details can then be encoded in a normal map of the 3D scan surface at lower resolution (see Fig. 6). To automatically add fine details on the fly to reconstructed clothing, we propose to leverage normal maps using a generative adversarial network [18]. See Fig. 8 for the architecture. In particular, our network induces temporal consistency on the normal maps to increase realism in animation applications.

Fig. 6.
figure 6

All visible details from an accurate 3D scan are generated in our normal map for incredible realism. Here, a virtual shirt is seamlessly added on top of an animated virtual human (e.g., scanned subject).

Fig. 7.
figure 7

Examples of our dataset. (Left) Low resolution input normal map, (Middle) High resolution target normal map from scan. Details, and noise, visible on the scan are reproduced in the image. Gray areas indicate no normal information was available on the scan. (Right) T-Shirt on a human model rendered without and with normal map.

4.1 Data Preparation

We take as inputs a 4D scan sequence, and a sequence of corresponding reconstructed garments. The latter can be either obtained by registration, reconstruction using blend shape or regression, as detailed in Sect. 3. Clothing template meshes \(\bar{\mathcal {T}}\) are equipped with UV maps which are used to project any pixel from an image to a point on a mesh surface, hence assigning a property encoded in a pixel to each point. Therefore, normal coordinates can be normalized and stored as pixel colors in normal maps. Our training dataset then consists of pairs of normal maps (see Fig. 7): low resolution (LR) normal maps obtained by blend shape reconstruction, and high resolution (HR) normal maps obtained from the scans. For LR normal maps, the normal at surface point (lying in a face) is linearly interpolated from vertex normals. For HR normal maps, per-pixel normals are obtained by projection of the high resolution observations (i.e., 4D scan) onto triangles of the corresponding low resolution reconstruction, and then the normal information is transferred using the UV map of \(\bar{\mathcal {T}}\). Note that normal maps cannot be directly calculated from scans because neither is the exact area of the garment defined, nor are they equipped with UV map. Also, our normals are represented in global coordinates, as opposed to tangent space coordinates as is standard for normal maps. The reason is that LR normal maps contain no additional information to the geometry and are therefore constant in tangent space. This makes them unfitting for conditioning our adversarial neural network.

4.2 Network Architecture

Due to the nature of our problem it is natural to explore network architectures designed to enhance images (i.e., super-resolution applications). From our experiments, we observed that models trained on natural images, including those containing a perceptual loss term fail (e.g., SR-GAN [28]). On the other hand, cloth deformations exhibit smooth patterns (wrinkles, creases, folds) that deform continuously in time. In addition, at a finer level, materials and fabric texture also contain high frequency details.

Our proposed network is based on a conditional Generative Adversarial Network (cGAN) inspired from image transfer [24]. We also use a convolution-batchnorm-ReLu structure [23] and a U-Net in the generative network since we want latent information to be transfered across the network layers and the overall structure of the image to be preserved. This happens thanks to the skip connections. The discriminator only penalizes structure at the scale of patches, and works as a texture loss. Our network is conditioned by low-resolution normal map images (size: 256 \(\times \) 256) which will be enhanced with fine details learned from our real data normal maps. See Fig. 8 for the complete architecture.

Fig. 8.
figure 8

cGAN for realistic HR normal map generation from LR normal maps as input. Layer sizes are squared. Skip connections (red) in U-Net preserve underlying image structure across network layers. PatchGAN enforces wrinkle pattern consistency. (Color figure online)

Temporal consistency is achieved by extending the L1 network loss term. For compelling animations, it is not only important that each frame looks realistic, but also no sudden jumps in the rendering should occur. To ensure smooth transition between consecutively generated images across time, we introduce an additional loss \(\mathcal {L}_{loss}\) to the GAN objective that penalizes discrepancies between generated images at t and expected images (from training dataset) at \(t-1\):

$$\begin{aligned} \mathcal {L}_{loss} =\ \underbrace{\Vert \mathcal {I}_{gen}^t - \mathcal {I}_{gt}^t \Vert _1}_{\mathcal {L}_{data}} + \underbrace{\mid \sum _{i,j} (\mathcal {I}_{gen}^t - \mathcal {I}_{gt}^{t-1})_{i,j} \mid }_{\mathcal {L}_{temp}}, \end{aligned}$$
(4)

where \(\mathcal {L}_{data}\) helps to generate images near to ground truth in an \(L_1\) sense (for less blurring). The temporal consistency term \(\mathcal {L}_{temp}\) is meant to capture global fold movements over the surface. If something appears somewhere, most of the time, it should have disappeared close-by and vice versa. Our term does not take spatial proximity into account though. We also tried temporal consistency based on the \(L_1\)- and \(L_2\)-norm, and report the results in Table 1. See Fig. 9 for a comparison of results with and without the temporal consistency term.

Fig. 9.
figure 9

Examples trained on only 2000 training samples reinforce the effect of the additional loss \(\mathcal {L}_{temp}\). The pairs show twice the same consecutive frames: (left) without temporal consistency term geometric noise appears or disappears instantly, and (right) with temporal consistency term preserves geometric continuity.

5 Experiments

This section evaluates the results of our reconstruction. 4D scan sequences were captured using a temporal-3dMD system (4D) [1]. Sequences are captured at 60fps. Each frame consists of a colored mesh with 200 K vertices. Here, we show results on two different shirts (for female and male). We trained the cGAN network on a dataset of 9213 consecutive frames. The first 8000 images compose the training data set, the next 1000 images the test data set and the remaining 213 images the validation set. Test and validation sets contain poses and movements not seen in the training set. The U-Net auto-encoder is constructed with 2 \(\times \) 8 layers, and 64 filters in each of the first convolutional layers. The discriminator uses patches of size 70 \(\times \) 70. \(\mathcal {L}_{data}\) weight is set to 100, \(\mathcal {L}_{temp}\) weight is 50, while GAN weight is 1. The images have a resolution of 256 \(\times \) 256, although our early experiments also showed promising results on 512 \(\times \) 512.

5.1 Comparison of Approaches

We compare our results to different approaches (see Fig. 10). A physics-based simulation done by a 3D artist using MarvelousDesigner [32] returns a mesh imitating similar material properties as our scan and with a comparable amount of folds but containing 53, 518 vertices (i.e., an order of magnitude more). A linear subspace reconstruction with 50 coefficients derived from the registrations (Sect. 3) produces a mostly flat surface, while the registration itself shows smooth approximations of the major folds in the scan. Our method, DeepWrinkles, adds all high frequency details seen in the scan to the reconstructed surface. These three methods use a mesh with 5, 048 vertices. DeepWrinkles is shown with a 256 \(\times \) 256 normal map image.

Fig. 10.
figure 10

Comparison of approaches. (a) Physics-based simulation [32], (b) Subspace (50 coefficients) [19], (c) Registration [38], (d) DeepWrinkles (ours), (e) 3D scan (ground truth).

5.2 Importance of Reconstruction Details in Input

Our initial experiments showed promising results reconstructing details from the original registration normal maps. To show the efficacy of the method it is not only necessary to reconstruct details from registration, but also from blend shapes, and after regression. We replaced the input images in the training set by normal maps constructed from the blend shapes with 500, 200 and 100 basis functions and one set from the regression reconstruction. The goal is to determine the amount of detail that is necessary in the input to obtain realistic detailed wrinkles. Table 1 shows the error rates in each experiment. 500 basis functions seem sufficient for a reasonable amount of detail in the result. Probably due to the fact that the reconstruction from regression is more noisy and bumpy, the neural network is not capable of reconstructing long defined folds and instead produces a lot of higher frequency wrinkles (see Fig. 11). This is an indicator that the structures of the inputs are only redefined by the net and important folds have to be visible in the input.

Table 1. Comparison of pixel-wise error values of the neural network for different training types. Data and Temporal are as defined in Eq. 4. (Left) Different temporal consistency terms. L1 and L2 take the respective distance between the output and target at time \(t-1\). (Right) Different reconstruction methods to produce the input normal map. Registr. refers to registration, BS to the blend shape with a certain number of basis functions and Regre. to regression.
Fig. 11.
figure 11

Examples of different training results for high resolution normal maps. Left to right: Global shape, target normal map, learned from registration normal map with temporal consistency, learned from blend shape with 200 basis functions and temporal consistency, as previous but with 500 basis functions, learned from registration normal map without temporal consistency. The example pose is not seen in the training set.

5.3 Retargeting

The final goal is to be able to scan one piece of clothing in one or several sequences and then transferring it on new persons with new movements on the go.

Poses. We experimented with various combinations of control vectors \(\varTheta \), including pose, shape, joint root velocity and acceleration history. It turns out most formulations in the literature are difficult to train or unstable [2, 31, 39]. We restrict the joint parameters to those directly related to each piece of clothing to reduce the dimensionality. In the case of shirts, this leaves the parameters related to the upper body. In general, linear regression generalized best but smoothed out a lot of overall geometric details, even in the training set. We evaluated on 9213 frames for 500 and 1000 blend shapes: \(MSE_{500} = 2.902\) and \(MSE_{1000} = 3.114\).

On the other hand, we trained an encoder-decoder with LSTM units (4 layers with dimension 256), using inputs and outputs equally of length 3 (see Sect. 3.3). We obtained promising results: \(MSE_{rnn} = 1.892\). Supplemental materials show visually good reconstructed sequences.

Shapes. In 3.2 we represented clothing with folds as offsets of a mean shape. The same can be done with a human template for persons with different body shapes. Each person \(\bar{\mathcal {P}_i}\) in normalized pose can be represented as an average template plus a vertex-wise offset \(\bar{\mathcal {P}_i} = \bar{\mathcal {T'}} + o'_i\). Given the fact that the clothing mean shape \(\bar{\mathcal {M}} = \bar{\mathcal {T'}}_{\mid \mathcal {M}} + o'_{\mid \mathcal {M}}\) contains a subset of vertices of the human template, it can be adjusted to any deformation of the template by taking \(\bar{\mathcal {M}}_{o'}= \bar{\mathcal {M}} + {o'_i}_{\mid \mathcal {M}}\). \(\mid \mathcal {M}\) restricts vertices of the human template to those used for clothing. Then the mean in the blend shape can simply be replaced by \(\bar{\mathcal {M}}_{o'}\). Equation 3 becomes:

$$\begin{aligned} B(\{\lambda _1^i,...,\lambda _k^i\}, \mathcal {\theta }_i) = S\left( \bar{\mathcal {M}} _{o'} + \sum _{l=0}^k \lambda _l^i \cdot V_l, \mathcal {\theta }_i\right) \approx {\mathcal {P}_i}_{\mid \mathcal {M}}, \end{aligned}$$
(5)

Replacing the mean shape affects surface normals. Hence, it is necessary to use normal maps in tangent space at rendering time. This makes them applicable to any body shape (see Fig. 12).

Fig. 12.
figure 12

Body shape retargeting. The first and fourth entry are shirts on the original models, the following two are retargeted to new body shapes.

6 Conclusion

We present DeepWrinkles, a entirely data-driven framework to capture and reconstruct clothing in motion out from 4D scan sequences. Our evaluations show that high frequency details can be added to low resolution normal maps using a conditional adversarial neural network. We introduce an additional temporal loss to the GAN objective that preserves geometric consistency across time, and show qualitative and quantitative evaluations on different datasets. We also give details on how to create low resolution normal maps from registered data, as it turns out registration fidelity is crucial for the cGAN training. The two presented modules are complementary to achieve accurate and realistic rendering of global shape and details of clothing. To the best of our knowledge, our methods exceeds the level of detail of the current state of the art in both physics-based simulation and data-driven approaches by far. Additionally, the space requirement of a normal map is negligible in comparison to increasing the resolution of clothing mesh, which makes our pipeline suitable to standard 3D engines.

Limitations. High resolution normal maps can have missing information in areas not seen by cameras, such as armpit areas. Hence, visually disruptive artifacts can occur although the clothing template can fix most of the issues (e.g., by doing a pass of smoothing). At the moment pose retargeting works best when new poses are similar to ones included in the training dataset. Although the neural network is able to generalize to some unseen poses, reconstructing the global shape from a new joint parameter sequence can be challenging. This should be fixed by scaling the dataset.

Future Work. Scanning setup can be extended to reconstruct all body parts with sufficient details without occlusions, and apply our method to more diverse types of clothing and accessories like coats, scarfs. Normal maps could also be used to add fine details like buttons which are hard to capture in 3D.