Synthetic Humans for Action Recognition from Unseen Viewpoints

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. Our goal in this work is to answer the question whether synthetic humans can improve the performance of human action recognition, with a particular focus on generalization to unseen viewpoints. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (i) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (ii) We introduce a new data generation methodology, SURREACT, that allows training of spatio-temporal CNNs for action classification; (iii) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (iv) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.


Introduction
Learning human action representations from RGB video data has been widely studied. Recent advances on convolutional neural networks (CNNs) [38] have shown excellent performance [6,15,16,18,41,82,86] on benchmark datasets, such as UCF101 [75]. However, the success of CNNs rely heavily on the availability of large-scale training data, which is not always the case. To address the lack of training data, several works explore the use of complementary synthetic data for a range of tasks in computer vision such as optical flow estimation, segmentation, human body and hand pose estimation [12,72,76,83,97]. In this work, we raise the question how to synthesize videos for action recognition in the case of limited real data, such as only one viewpoint, or oneshot available at training. Imagine a surveillance or ambient assisted living system, where a dataset is already collected for a set of actions from a certain camera. Placing a new camera in the environment from a new viewpoint would require re-annotating data because the appearance of an action is drastically different when performed from different viewpoints [26,44,95]. In fact, we observe that state-of-the-art action recognition networks fail drastically when trained and tested on distinct viewpoints. Specifically, we train the model of [18] on videos from a benchmark dataset NTU RGB+D [69] where people are facing the camera. When we test this network on other front-view (0 • ) videos, we obtain ∼80% accuracy. When we test with side-view (90 • ) videos, the performance drops to ∼40% (see Section 4). This motivates us to study action recognition from novel viewpoints.
Existing methods addressing cross-view action recognition do not work in challenging setups (e.g. same subjects and similar viewpoints in training and test splits [69]). We introduce and study a more challenging protocol with only one viewpoint at training. Recent methods assuming multi-view training data [39,40,84] also become inapplicable.
A naive way to achieve generalization is to collect data from all views, for all possible conditions, but this is impractical due to combinatorial explosion [91]. Instead, we augment the existing real data synthetically to increase the diversity in terms of viewpoints, appearance, and motions. Synthetic humans are relatively easy to render for tasks such as pose estimation, because arbitrary motion capture (MoCap) resource can be used [72,83]. However, action classification requires certain motion patterns and semantics. It is challenging to generate synthetic data with action labels [10]. Typical MoCap datasets [1], targeted for pose diversity, are not suitable for action recognition due to lack of clean action annotations. Even if one collects a MoCap dataset, it is still limited to pre-defined set of categories.
In this work, we propose a new, efficient and scalable approach for generating synthetic videos with action labels from the target set of categories. We employ a 3D human motion estimation method, such as HMMR [28] and VIBE [31], that automatically extracts the 3D human dynamics from a single-view RGB video. The resulting sequence of SMPL body [50] pose parameters are then combined with other randomized generation components (e.g. viewpoint, clothing) to render diverse complementary training data with action annotations. Figure 1 presents an overview of our pipeline. We demonstrate the advantages of such data when training spatio-temporal CNN models for (i) action recognition from unseen viewpoints and (ii) training with one-shot real data. We boost performance on unseen viewpoints from 53.6% to 69.0% on NTU, and from 49.4% to 66.4% on UESTC dataset by augmenting limited real training data with our proposed SURREACT dataset. Furthermore, we present an in-depth analysis about the importance of action relevant augmentations such as diversity of motions and viewpoints, as well as our non-uniform frame sampling strategy which substantially improves the action recognition performance. Our code and data will be available at the project page 1 .

Related Work
Human action recognition is a well-established research field. For a broad review of the literature on action recognition, see the recent survey of Kong et al. [35]. Here, we focus on relevant works on synthetic data, cross-view action recognition, and briefly on 3D human shape estimation.
Among previous works that focus on synthetic human data, very few tackle action recognition [10,43,66]. Synthetic 2D human pose sequences [53] and synthetic point trajectories [65,67,25] have been used for viewinvariant action recognition. However, RGB-based synthetic training for action recognition is relatively new, with [10] being one of the first attempts. De Souza et al. [10] manually define 35 action classes and jointly estimate real categories and synthetic categories in a multitask setting. However, their categories are not easily scalable and do not necessarily relate to the target set of classes. Unlike [10], we automatically extract motion sequences from real data, making the method flexible for new categories. Recently, [62] has generated the Vir-tualHome dataset, a simulation environment with programmatically defined synthetic activities using crowdsourcing. Different than our work, the focus of [62] is not generalization to real data.
Most relevant to ours, [43] generates synthetic training images to achieve better performance on unseen viewpoints. The work of Liu et al. [43] is an extension of [66] by using RGB-D as input instead of depth only. Both works formulate a frame-based pose classification problem on their synthetic data, which they then use as features for action recognition. These features are not necessarily discriminative for the target action categories. Different than this direction, we explicitly assign an action label to synthetic videos and define the supervision directly on action classification.
Cross-view action recognition. Due to the difficulty of building multi-view action recognition datasets, the standard benchmarks have been recorded in controlled environments. RGB-D datasets such as IXMAS [87], UWA3D II [64] and N-UCLA [85] were state of the art until the availability of the large-scale NTU RGB+D dataset [69]. The size of NTU allows training deep neural networks unlike previous datasets. Very recently, Ji et al. [24] collected the first large-scale dataset, UESTC, that has a 360 • coverage around the performer, although still in a lab setting.
Since multi-view action datasets are typically captured with depth sensing devices, such as Kinect, they also provide an accurate estimate of the 3D skeleton. Skeleton-based cross-view action recognition therefore received a lot of attention in the past decade [30,45,46,47,93]. Variants of LSTMs [21] have been widely used [45,46,69]. Recently, spatio-temporal skeletons were represented as images [30] or higher dimensional objects [47] where standard CNN architectures were applied.
RGB-based cross-view action recognition is in comparison less studied. Transforming RGB features to be view-invariant is not as trivial as transforming 3D skeletons. Early work on transferring appearance features from the source view to the target view explored the use of maximum margin clustering to build a joint codebook for temporally synchronous videos [14]. Following this approach, several other works focused on building global codebooks to extract view-invariant representations [34,49,67,94,95]. Recently, end-to-end approaches used human pose information as guidance for action recognition [3,48,52,98]. Li et al. [39] formulated an adversarial view-classifier to achieve view-invariance. Wang et al. [84] proposed to fuse view-specific features from a multi-branch CNN. Such approaches cannot handle single-view training [39,84]. Our method differs from these works by compensating for the lack of view diversity with synthetic videos. We augment the real data automatically at training time, and our model does not involve any extra cost at test time unlike [84]. Moreover, we do not assume real multi-view videos at training.
3D human shape estimation. Recovering the full human body mesh from a single image has been explored as a model-fitting problem [5,37], as regressing model parameters with CNNs [27,58,59,80], and as regressing non-parametric representations such as graphs or volumes [33,81]. Recently, CNN-based parameter regression approaches have been extended to video [28,42,31]. HMMR [28] builds on the single-image-based HMR [27] to learn the human dynamics by using 1D temporal convolutions. More recently, VIBE [31] adopts a recurrent model based on frame-level pose estimates provided by SPIN [32]. VIBE also incorporates an adversarial loss that penalizes the estimated pose sequence if it is not a 'realistic' motion, i.e., indistinguishable from the real AMASS [54] MoCap sequences. In this work, we recover 3D body parameters from real videos using HMMR [28] and VIBE [31]. Both methods employ the SMPL body model [50]. We provide a comparison between the two methods for our purpose of action recognition, which can serve as a proxy task to evaluate motion estimation.

Synthetic humans with action labels
Our goal is to improve the performance of action recognition using synthetic data in cases where the real data is limited, e.g. domain mismatch between training/test such as viewpoints or low-data regime. In the following, we describe the three stages of: (i) obtaining 3D temporal models for human actions from real training sequences (at a particular viewpoint) (Section 3.1); (ii) using these 3D temporal models to generate training sequences for new (and the original) viewpoints using a rendering pipeline with augmentation (Section 3.2); and (iii) training a spatio-temporal CNN with both real and synthetic data (Section 3.3).

3D human motion estimation
In order to generate a synthetic video with graphics techniques, we need to have a sequence of articulated 3D human body models. We employ the parametric body model SMPL [50], which is a statistical model, learned over thousands of 3D scans. SMPL generates the mesh of a person given the disentangled pose and shape parameters. The pose parameters (R 72 ) control the kinematic deformations due to skeletal posture, while the shape parameters (R 10 ) control identity-specific deformations such as the person height.
We hypothesize that a human action can be captured by the sequence of pose parameters, and that the shape parameters are largely irrelevant (note, this may not necessarily be true for human-object interaction categories). Given reliable 3D pose sequences from action recognition video datasets, we can transfer the associated action labels to synthetic videos. We use the recent method of Kanazawa et al. [28], namely human mesh and motion recovery (HMMR), unless stated otherwise. HMMR extends the single-image reconstruction method HMR [27] to video with a multi-frame CNN that takes into account a temporal neighborhood around a video frame. HMMR learns a temporal representation for human dynamics by incorporating largescale 2D pseudo-ground truth poses for in-the-wild videos. It uses PoseFlow [92] and AlphaPose [13] for multiperson 2D pose estimation and tracking as a pre-processing step. Each person crop is then given as input to the CNN for estimating the pose and shape, as well as the weak-perspective camera parameters. We refer the reader to [28] for more details. We choose this method for the robustness on in-the-wild videos, ability to capture multiple people, and the smoothness of the recovered motion, which are important for our generalization from synthetic videos to real. Figure 1 presents the 3D pose animated synthetically for sample video frames. We also experiment with the more recent motion estimation method, VIBE [31], and show that improvements in motion estimation proportionally affect the action recognition performance in our pipeline. Note that we only use the pose parameters from HMMR or VIBE, and randomly change the shape parameters, camera parameters, and other factors. Next, we present the augmentations in our synthetic data generation.

SURREACT dataset components
In this section, we give details on our synthetic dataset, SURREACT (Synthetic hUmans foR REal ACTions). We follow [83] and render 3D SMPL sequences with randomized cloth textures, lighting, and body shapes. We animate the body model with our automatically extracted pose dynamics as described in the previous section. We explore various motion augmentation techniques to increase intra-class diversity in our training videos. We incorporate multi-person videos which are especially important for two-people interaction categories. We also systematically sample from 8 viewpoints around a circle to perform controlled experiments. Different augmentations are illustrated in Figure 2 for a sample synthetic frame. Visualizations from SURRE-ACT are further provided in Figure 3.
Each generated video has automatic ground truth for 3D joint locations, part segmentation, optical flow, and SMPL body [50] parameters, as well as an action label, which we use for training a video-based 3D CNN for action classification. We use other ground truth modalities as input to action recognition as oracle experiments (see Table A.5). We further use the optical flow ground We modify the joint angles with additive noise on the pose parameters for motion augmentation. We systematically change the camera position to create viewpoint diversity. We sample from a large set of body shape parameters, backgrounds, and clothing to randomize appearances.
truth to train a flow estimator and use the segmentation to randomly augment the background pixels in some experiments.
Our new SURREACT dataset differs from the SUR-REAL dataset [83] mainly by providing action labels, exploring motion augmentation, and by using automatically extracted motion sequences instead of MoCap recordings [1]. Moreover, Varol et al. [83] do not exploit the temporal aspect of their dataset, but only train CNNs with single-image input. We further employ multi-person videos and a systematic viewpoint distribution.
Motion augmentation. Automatic extraction of 3D sequences from 2D videos poses an additional challenge in our dataset compared to clean high-quality MoCap sequences. To reduce the jitter, we temporally smooth the estimated SMPL pose parameters by weighted linear averaging. SMPL poses are represented as axisangle rotations between joints. We convert them into quaternions when we apply linear operations, then normalize each quaternion to have a unit norm, before converting back to axis-angles. Even with this processing, the motions may remain noisy, which is inevitable given that monocular 3D motion estimation is a difficult task on its own. Our findings interestingly suggest that the synthetic human videos are still beneficial when the motions are noisy.
To increase motion diversity, we further perturb the pose parameters with various augmentations. Specifically, we use a video-level additive noise on the quaternions for each body joint to slightly change the poses, as an intra-individual augmentation. We also experiment with an inter-individual augmentation by interpolating between motion sequences of the same action class. Given a pair of sequences from two individuals, we first align them with dynamic time warping [68], then we linearly interpolate the quaternions of the time-aligned sequences to generate a new sequence, which we refer as interpolation. A visual explanation of the process can  Multi-person. We use the 2D pose information from [13,92] to count the number of people in the real video. In the case of a single-person, we center the person on the image and do not add 3D translation to the body, i.e., the person is centered independently for each frame. While such constant global positioning of the body loses information for some actions such as walking and jumping, we find that the translation estimate adds more noise to consider this information and potentially increases the domain gap with the real where no such noise exists (see Appendix A). If there is more than one person, we insert additional body model(s) for rendering. We translate each person in the xy image plane. Note that we do not translate the person in full xyz space. We observe that the z component of the translation estimation is not reliable due to the depth ambiguity therefore the people are always centered at z = 0. More explanations about the reason for omitting the z component can be found in Appendix A. We temporally smooth the translations to reduce the noise. We subtract the mean of translations across the video and across the people to roughly center all people to the frame. We therefore keep the relative distances between people, which is important for actions such as walking towards each other.
Viewpoints. We systematically render each motion sequence 8 times by randomizing all other generation parameters at each view. In particular, we place the camera to be rotated at {0 • , 45 • , 90 • , 135 • , 180 • , 225 • , 270 • , 315 • } azimuth angles with respect to the origin, denoted as (0 • :45 • :360 • ) in our experiments. The distance of the camera from the origin and the height of the camera from the ground are randomly sampled from a predefined range: [4,6] meters for the distance, [−1, 3] meters for the height. This can be adjusted according to the target test setting.
Backgrounds. Since we have access to the target real dataset where we run pose estimation methods, we can extract background pixels directly from the training set of this dataset. We crop from regions without the person to obtain static backgrounds for the NTU and UESTC datasets. We experimentally show the benefits of using the target dataset backgrounds in the Appendix (see Table A.6). For Kinetics experiments, we render human bodies on top of unconstrained videos from nonoverlapping action classes and show benefits over static backgrounds. Note that these background videos might also include human pixels.

Training 3D CNNs with non-uniform frames
Following the success of 3D CNNs for video recognition [6,18,79], we employ a spatio-temporal convolutional architecture that operates on multi-frame video inputs. Unless otherwise specified, our network architecture is 3D ResNet-50 [18] and its weights are randomly initialized (see Appendix B.4 for pretraining experiments).
To study the generalization capability of synthetic data across different input modalities, we train one CNN for RGB and another for optical flow as in [74]. We average the scores with equal weights when reporting the fusion.
We subsample fixed-sized inputs from videos to have a 16 × 256 × 256 spatio-temporal resolution, in terms of number of frames, width, and height, respectively. In case of optical flow input, we map the RGB input to 15 × 64 × 64 dimensional flow estimates. To estimate flow, we train a two-stack hourglass architecture [57] with our synthetic data for flow estimation on 2 consecutive frames. We refer the reader to Figure A Non-uniform frame sampling. We adopt a different frame sampling strategy than most works [6,15,18] in the context of 3D CNNs. Instead of uniformly sampling (at a fixed frame rate) a video clip with consecutive frames, we randomly sample frames across time Compared to the commonly adopted consecutive setting which uniformly samples with a fixed frame rate, non-uniform sampling has random skips in time, allowing speed augmentations and long-term context.
by keeping their temporal order, which we refer as nonuniform sampling. Although recent works explore multiple temporal resolutions, e.g. by regularly sampling at two different frame rates [15], or randomly selecting a frame rate [96], the sampled frames are equidistant from each other. TSN [86] and ECO [99] employ a hybrid strategy by regularly sampling temporal segments and randomly sampling a frame from each segment, which is a more restricted special case of our strategy. Moreover, TSN uses a 2D CNN without temporal modelling.
[99] also has 2D convolutional features on each frame, which are stacked as input to a 3D CNN only at the end of the network. None of these works provide controlled experiments to quantify the effect of their sampling strategy. The concurrent work of [7] presents an experimental analysis comparing the dense consecutive sampling with the hybrid sampling of TSN. Figure 4 compares the consecutive sampling with our non-uniform sampling. In our experiments, we report results for both and show improvements for the latter. Our videos are temporally trimmed around the action, therefore, each video is short, i.e. spans several seconds. During training we randomly sample 16 video frames as a fixed-sized input to 3D CNN. Thus, the convolutional kernels become speed-invariant to some degree. This can be seen as a data augmentation technique, as well as a way to capture long-term cues.
At test time, we sample several 16-frame clips and average the softmax scores. If we test the uniform case, we sample non-overlapping consecutive clips with sliding window. For the non-uniform case, we randomly sample as many non-uniform clips as the number of sliding windows for the uniform case. In other words, the number of sampled clips is proportional to the video length. More precisely, let T be the number of frames in the entire test video, F be the number of input frames per clip, and S be the stride parameter. We sample N clips where N = (T − F )/S) + 1. In our case F = 16, S = 16. We apply sliding window for the uniform case. For the non-uniform case, we sample N clips, where each clip is an ordered random (without replacement) 16-frame subset from T . We observe that it is important to train and test with the same sampling scheme, and keeping the temporal order is important. More details can be found in Appendix B.5.
Synth+Real. Since each real video is augmented multiple times (e.g. 8 times for 8 views), we have more synthetic data than real. When we add synthetic data to training, we balance the real and synthetic datasets such that at each epoch we randomly subsample from the synthetic videos to have equal number for both real and synthetic.
We minimize the cross-entropy loss using RMSprop [78] with mini-batches of size 10 and an initial learning rate of 10 −3 with a fixed schedule. Color augmentation is used for the RGB stream. Other implementation details are given in Appendix A.

Experiments
In this section, we start by presenting the action recognition datasets used in our experiments (Section 4.1). Next, we present extensive ablations for action recognition from unseen viewpoints (Section 4.2). Then, we compare our results to the state of the art for completeness (Section 4.3). Finally, we illustrate our approach on in-the-wild videos (Section 4.4).

Datasets and evaluation protocols
We briefly present the datasets used in this work, as well as the evaluation protocols employed.
NTU RGB+D dataset (NTU). This dataset [69] captures 60 actions with 3 synchronous cameras (see Figure 5). The large scale (56K videos) of the dataset allows training deep neural networks. Each sequence has 84 frames on average. The standard protocols [69] report accuracy for cross-view and cross-subject splits. The cross-view (CV) split considers 0 • and 90 views as training and 45 • view as test, and the same subjects appear both in training and test. For the cross-subject (CS) setting, 20 subjects are used for training, the remaining 20 for test, and all 3 views are seen at both training and test. We report on the standard protocols to be able to compare to the state of the art (see Table 8). However, we introduce a new protocol to make the task more challenging. From the cross-subject training split that has all 3 views, we take only 0 • viewpoint for training, and we test on the 0 • , 45 • , 90 • views of the cross-subject test split. We call this protocol crossview-subject (CVS). Our focus is mainly to improve for the unseen and distinct view of 90 • .
UESTC RGB-D varying-view 3D action dataset (UESTC). UESTC is a recent dataset [24] that systematically collects 8 equally separated viewpoints that cover 360 • around a person (see Figure 5). In total, the dataset has 118 subjects, 40 actions categories, and 26500 videos of more than 200 frames each. This dataset allows studying actions from unusual views such as behind the person. We use the official protocol Cross View I (CV-I), suitable for our task, which trains with 1 viewpoint and tests with all other 7 for each view. The final performance is evaluated as the average across all tests. For completeness, we also report the Cross View II (CV-II) protocol that concentrates on multi-view training, i.e., training with even viewpoints (FV, V2, V4, V6) and testing with odd viewpoints (V1, V3, V5, V7), and vice versa.
One-shot Kinetics-15 dataset (Kinetics-15). Since we wish to formulate a one-shot scenario from in-thewild Kinetics [29] videos, we need a pre-trained model to serve as feature extractor. We use a model pre-trained on Mini-Kinetics-200 [88], a subset of Kinetics-400. We define the novel classes from the remaining 200 categories which can be described by body motions. This procedure resulted in a 15-class subset of Kinetics-400: bending back, clapping, climbing a rope, exercising arm, hugging, jogging, jumpstyle dancing, krumping, push up, shaking hands, skipping rope, stretching arm, swinging legs, sweeping floor, wrestling. Note that many of the categories such as waiting in line, dining, holding snake cannot be recognized solely by their body motions, but additional contextual cues are needed. From the 15 actions, we randomly sample 1 training video per class (see Figure 8 for example videos with their synthetic augmentations). The training set therefore consists of 15 videos. For testing, we report accuracy on all 725  Table 1: Training jointly on synthetic and real data substantially boosts the performance compared to only real training on NTU CVS protocol, especially on highlighted unseen views (e.g., 69.0% vs 53.6%). The improvement can be seen for both RGB and Flow streams, as well as the fusion. We note the marginal improvements with the addition of flow unlike in other tasks where flow has been used to reduce the synthetic-real domain gap [11]. We render two different versions of the synthetic dataset using HMMR and VIBE motion estimation methods, and observe improvements with VIBE. Moreover, training on synthetic videos alone is able to obtain 63.0% accuracy.  validation videos from these 15 classes. The limitation of this protocol is that it is sensitive to the choice of the 15 training videos, e.g., if 3D motion estimation fails on one video, the model will not benefit from additional synthetic data of one class. Future work can consider multiple possible training sets (e.g., sampling videos where 3D pose estimation is confident) and report average performance.

Ablation Study
We first compare real-only (Real), synthetic-only (Synth), and mixed synthetic and real (Synth+Real) training. Next, we explore the effect of the motion estimation quality and inputting raw motion parameters as opposed to synthetic renderings. Then, we experiment with the different synthetic data generation parameters to analyze the effects of viewpoint and motion diversity.
In all cases, we evaluate our models on real test videos.
Real baselines. We start with our cross-view-subject protocol on NTU by training only with real data. Ta Non-uniform frame sampling. We note the consistent improvement of non-uniform frame sampling over the uniform consecutive sampling in all settings in Table 2. Additional experiments about video frame sampling, such as the optical flow stream, can be found in Appendix B.5. We use our non-uniform sampling strategy for both RGB and flow streams in the remainder of experiments unless specified otherwise.
Synth+Real training. Next, we report the improvements obtained by synthetically increasing view diversity. We train the 60 action classes from NTU by combining the real 0 • training data and the synthetic data augmented from real with 8 viewpoints, i.e. 0 • :45 • :360 • . Table 1 compares the results of Real, Synth, and Synth+Real trainings for RGB and Flow streams, as well as their combination. The performance of the flow stream is generally lower than that of the RGB stream, possibly due to the fine-grained categories which cannot be distinguished with coarse motion fields. It is interesting to note that training only with synthetic data (Synth) reaches 63.0% accuracy on real 0 • test data which indicates a certain level of generalization capability from synthetic to real. Combining real and synthetic training videos (Real+Synth), the performance of the RGB stream increases from 53.6% to 69.0% compared to only real training (Real), on the challenging unseen 90 • viewpoint. Note that the additional synthetic videos can be obtained 'for free', i.e. without extra annotation cost. We also confirm that even the noisy motion estimates are sufficient to obtain Fig. 6: Inputting raw motion parameters performs significantly worse for the 90 • unseen viewpoint compared to synthetic renderings on the NTU CVS protocol. We compare various input representations with increasing view-independence (joint coordinates, SMPL pose parameters, SMPL pose parameters without the global rotation). Experiments are carried out with SMPL model recovered with RGB-based methods HMMR [28] and VIBE [31], and depth-based Kinect joints. A 2D ResNet architecture is used for motion parameter inputs similar to [30]. We also present an architecture study in Table 3. Note that significant gains are further possible when mixing the synthetic renderings with real videos. See text for interpretation.
significant improvements, suggesting that the discriminative action information is still present in our synthetic data.
The advantage of having a controllable data generation procedure is to be able to analyze what components of the synthetic data are important. In the following, we examine a few of these aspects, such as quality of the motion estimation, input representation, amount of data, view diversity, and motion diversity. Additional results can be found in Appendix B.
Quality of the motion estimation: HMMR vs VIBE. 3D motion estimation from monocular videos has only recently demonstrated convincing performance on unconstrained videos, opening up the possibility to investigate our problem of action recognition with synthetic videos. One natural question is whether the progress in 3D motion estimation methods will improve the synthetic data. To this end, we compare two sets of synthetic data, keeping all the factors the same except the motion source: Synth HMMR extracted with HMMR [28], Synth VIBE extracted with VIBE [31]. Table 1 presents the results. We observe consistent improvements with more accurate pose estimation from VIBE over HMMR, suggesting that our proposed pipeline has great potential to further improve with the progress in 3D recovery.
Raw motion parameters as input. Another question is whether the motion estimation output, i.e., body pose parameters, can be directly used as input to an action recognition model instead of going through synthetic renderings. We implement a simple 2D CNN architecture similar to [30] that inputs 16-frame pose sequence in the form of 3D joint coordinates (24 joints for SMPL, 25 joints for Kinect) or 3D joint rotations (24 axis-angle parent-relative rotations for SMPL, or 23 without the global rotation). In particular, we use a ResNet-18 architecture [20]. We experiment with both HMMR and VIBE to use SMPL parameters as input, as well as Kinect joints provided by the NTU dataset  for comparison. Figure 6 reports the results of various pose representations against the performance of synthetic renderings for three test views. We make several observations: (i) Removing viewpoint-dependent factors, e.g., pose parameters over joints, degrades performance on seen viewpoint, but consistently improves on unseen viewpoints; (ii) Synthetic video renderings from all viewpoints significantly improve over raw motion parameters for the challenging unseen viewpoint; (iii) VIBE outperforms HMMR; (iv) Both RGB-based motion estimation methods are competitive with the depth-based Kinect joints. We note the significant boost with renderings (45.3%) over pose parameters (29.0%) for the 90 • test view despite the same source of motion information for both. There are three main differences which can be potential reasons. First, the architectures 3D ResNet and 2D ResNet have different capacities. Second, motion estimation from non-frontal viewpoints can be challenging, negatively affecting the performance of pose-based methods, but not affecting 3D ResNet (because pose estimation is not a required step). Third, the renderings have the advantage that standard data augmentation techniques on image pixels can be applied, unlike the pose parameters which are not augmented. More importantly, the renderings have the advantage that they can be mixed with the real videos, which showed to substantially improve the performance in Table 1.
To explore the architectural capacity question, we study the pose-based action recognition model further and experiment with the recent ST-GCN model [89] that makes use of graph convolutions. For this experiment, we use VIBE pose estimates and compare ST-GCN with the 2D ResNet architecture in Table 3. Although we observe improvements with using ST-GCN (29.0% vs 36.2%), the synthetic renderings provide significantly better generalization to the unseen 90 • view (45.3%).
Amount of data. In the NTU CVS training split, we have about 220 sequences per action. We take subsets with {10, 30, 60, 100} sequences per action, and train the three scenarios: Real, Synth, Synth+Real, for each subset. Figure 7 plots the performance versus the amount of data for these scenarios, for both RGB and Flow streams. We observe the consistent improvement of complementary synthetic training, especially for unseen viewpoints. We also see that it is more effective to use synthetic data at a given number of sequences per action. For example, on the 90 • viewpoint, increasing the number of sequences from 100 to 220 in the real data results only in 4.6% improvement (49.0% vs 53.6%, Real), while one can synthetically augment the existing 100 sequences per action and obtain 64.7% (Synth+Real) accuracy without spending extra annotation effort.
View diversity. We wish to confirm that the improvements presented so far are mainly due to the viewpoint variation in synthetic data. The "Synth(v=0 • ) + Real" plot in Figure 7 indicates that only the 0 • viewpoint from synthetic data is used. In this case, we observe that the improvement is not consistent. Therefore, it is important to augment viewpoints to obtain improvements. Moreover, we experiment with having only ±45 • or ±90 • views in the synthetic-only training for 60 sequences per action. In Table 4, we observe that the test performance is higher when the synthetic training view matches the real test view. However, having all 8 viewpoints at training benefits all test views.
Motion diversity. Next, we investigate the question whether motions can be diversified and whether this is beneficial for synthetic training. There are very few attempts towards this direction [10] since synthetic data has been mainly used for static images. Recently, [42] introduced interpolation between distinct poses to create new poses in synthetic data for training 3D pose estimation; however, its contribution over existing poses   was not experimentally validated. In our case, we need to preserve the action information, therefore, we cannot generate unconstrained motions. Generating realistic motions is a challenging research problem on its own and is out of the scope of this paper. Here, we experiment with motion augmentation to increase diversity. As explained in Section 3.1, we generate new motion sequences by (i) interpolating between motion pairs of the same class, or by (ii) additive noise on the pose parameters. Table 5 presents the results of this analysis when we train only with synthetic data and test on the NTU CVS protocol. We compare to the baseline where 10 motion sequences per action are rendered once per viewpoint (the first row). We render the same sequences without motion augmentation 6 times (the second row) and obtain marginal improvement. On the other hand, having 60 real motion sequences per action significantly improves (last row) and is our upper bound for the motion augmentation experiments. That means that the clothing, body shape, lighting, i.e. appearance diversity is not as important as motion diversity. We see that generating new sequences with interpolations im-   Table 6: UESTC dataset Cross View I protocol: Training on 1 viewpoint and testing on all the others. The plots on the right show individual performances for the RGB networks. The rows and columns of the matrices correspond to training and testing views, respectively. We obtain significant improvements over the state of the art, due to our non-uniform frame sampling and synthetic training.
proves over the baseline. Moreover, perturbing the joint rotations across the video with additive noise is simple and effective, with performance increase of about 5% (26.2% vs 31.5%) over rendering 6 times without motion augmentation. To justify the video-level noise (i.e., one value to add to all frames), in Table 5, we also experiment with frame-level noise and a hybrid version where we independently sample a noise at every 25 frames, which are interpolated for the frames in between. These renderings qualitatively remain very noisy, reducing the performance in return.

Comparison with the state of the art
In the following, we employ the standard protocols for UESTC and NTU datasets, and compare our performance with other works. Tables 6 and 7 compare our results to the state-of-the-art methods reported by [24] on the recently released UESTC dataset, on CV-I and CV-II protocols. To augment the UESTC dataset, we use the VIBE motion estimation method. We outperform the RGB-based methods JOULE [23] and 3D ResNeXt-101 [18] by a large margin even though we use a less deep 3D ResNet-50 architecture. We note that we have trained the ResNeXt-101 architecture [18] with our implementation and obtained better results than our ResNet-50 architecture (45.2% vs 36.1% on CV-I, 82.5% vs 76.1% on CV-II). This contradicts the results reported in [24]. We note that a first improvement can be attributed to our non-uniform frame sampling strategy. Therefore, we report our uniform real baseline as well. A significant performance boost is later obtained by having a mixture of synthetic and real training data. Using only RGB input, we obtain 17.0% improvement on the challenging CV-I protocol over real data (66.4 vs 49.4). Using both RGB and flow, we obtain 44.1% improvement over the state of the art (76.1 vs 32.0). We also report on the even/odd test splits of the CV-II protocol that have access to multi-view training data. The synthetic data again shows benefits over the real baselines. Compared to NTU, which contains object interactions that we do not simulate, the UESTC dataset focuses more on the anatomic movements, such as body   exercises. We believe that these results convincingly demonstrate the generalization capability of our efficient synthetic data generation method to real body motion videos.
In Table 8, we compare our results to the state-ofthe-art methods on standard NTU splits. The synthetic videos are generated using the HMMR motion estimation method. Our results on both splits achieve stateof-the-art performance only with the RGB modality. In comparison, [4,52,98] use pose information during training. [51] uses other modalities from Kinect such as depth and skeleton during training. Similar to us, [84] uses a two-stream approach. Our non-uniform sampling boosts the performance. We have moderate gains with the synthetic data for both RGB and flow streams, as the real training set is already large and similar to the test set.

One-shot training
We test the limits of our approach on unconstrained videos of the Kinetics-15 dataset. These videos are chal-  lenging for several reasons. First, the 3D human motion estimation fails often due to complex conditions such as motion blur, low-resolution, occlusion, crowded scenes, and fast motion. Second, there exist cues about the action context that are difficult to simulate, such as object interactions, bias towards certain clothing or environments for certain actions. Assuming that body motions alone, even when noisy, provide discriminative information for actions, we augment the 15 training videos of one-shot Kinetics-15 subset synthetically using HMMR (see Figure 8) by rendering at 5 viewpoints (0 • , 30 • , 45 • , 315 • , 330 • ).
We use a pre-trained feature extractor model and only train a linear layer from the features to the 15 classes. We observe over-fitting with higher-capacity models due to limited one-shot training data. We experiment with two pre-trained models, obtained from [9]: RGB and flow. The models follow the 3D ResNeXt-101 architecture from [18] and are pre-trained on Mini-Kinetics-200 categories with 16 × 112 × 112 resolution with consecutive frame sampling.
In Table 9 (top), we first provide simple baselines: nearest neighbor with pre-trained features is slightly above random chance (8.6% vs 6.7% for RGB). Table 9 (bottom) shows training linear layers. Using only synthetic data obtains poor performance (9.4%). Training only with real data on the other hand obtains 26.2%, which is our baseline performance. We obtain ∼6% improvement by adding synthetic data. We also experiment with static background images from the LSUN dataset [90] and note the importance of realistic noisy backgrounds for generalization to in-the-wild videos.

Conclusions
We presented an effective methodology for automatically augmenting action recognition datasets with synthetic videos. We explored the importance of different variations in the synthetic data, such as viewpoints and motions. Our analysis emphasizes the question on how to diversify motions within an action category. We obtain significant improvements for action recognition from unseen viewpoints and one-shot training. However, our approach is limited by the performance of the 3D pose estimation, which can fail in cluttered scenes. Possible future directions include action-conditioned generative models for motion sequences and simulation of contextual cues for action recognition.

APPENDIX
This appendix provides detailed explanations for several components of our approach (Section A). We also report complementary results for synthetic training, and our nonuniform frame sampling strategy (Section B).

A Additional details
SURREACT rendering. We build on the implementation of [83] and use the Blender software. We add support for multi-person images, for using estimated motion inputs, for systematic viewpoint rendering, and different sources for background images. We use the cloth textures released by [83], i.e., 361/90 female, 382/96 male textures for training/test splits, respectively. The resolution of the video frames is similarly 320x240 pixels. For background images, we used 21567/8790 train/test images extracted from NTU videos, and 23034/23038 train/test images extracted from UESTC videos, by sampling a region outside of the person bounding boxes. The rendering code takes approximately 6 seconds per frame, for saving RGB, body-part segmentation and optical flow data. We parallelize the rendering over hundreds of CPUs to accelerate the data generation.
Motion sequence interpolation. As explained in Section 3.2 of the main paper, we explore creating new sequences by interpolating pairs of motions from the same action category. Here, we visually illustrate this process. Figure A.1 shows two sequences of sitting down that are first aligned with dynamic time warping, and then linearly interpolated. We only experiment with equal weights when interpolating (i.e. 0.5), but one can sample different weights when increasing the number of sequences further.
3D translation in SURREACT. In Section 3.2 of the main paper, we explained that we translate the people in the xy image plane only when there are multiple people in the scene. HMMR [28] estimates the weak-perspective camera scale, jointly with the body pose and shape. We note that obtaining 3D translation of the person in the camera coordinates is an ambiguous problem. It requires the size of the person to be known. This becomes more challenging in the case of multi-person videos.
HMMR relies on 2D pose estimation to locate the bounding box of the person which then becomes the input to a CNN. The CNN outputs a scale estimation s b together with the [x b , y b ] normalized image coordinates of the person center with respect to the bounding box. We first convert these values to be with respect to the original uncropped image: s and [x, y]. We can recover an approximate value for the z coordinate of the person center, by assuming a fixed focal length   values are more reliable. We therefore assume that the person is always centered at z = 0 and apply the translation only in the xy plane. We observe that due to the noisy 2D person detections the estimated translation is noisy even in the xy image plane, leading to less generalization performance on real data when we train only with synthetic data. We validate this empirically in Table A.1. We render multiple versions of the synthetic dataset with 10 motion sequences per action, each rendered from 8 viewpoints. We train only with this synthetic data and evaluate on the real NTU CVS protocol. Including multiple people improves performance (first and second rows), mainly because 11 out of 60 action categories in NTU are two-person interactions. Figure A.6 also shows the confusion matrix of training only with single-person, resulting in the confusion of the interaction categories. Dropping the z component from the translation further improves (second and third rows). We also experiment with no translation if there is a single person, and xy translation only for the multi-person case (fourth row), which has the best generalization performance. This is unintuitive since some actions such as jumping are not re-  alistic when the vertical translation is not simulated. This indicates that the translation estimations from the real data need further improvement to be incorporated in the synthetic data. Our 3D CNN is otherwise sensitive to the temporal jitter induced by the noisy translations of the people.
Flow estimation. We train our own optical flow estimation CNN, which we use to compute the flow in an online fashion, during action classification training. In other words, we do not require pre-processing the videos for training. To do so, we use a light-weight stacked hourglass architecture [57] with two stacks. The input and output have 256 × 256 and 64 × 64 spatial resolution, respectively. The input consists of 2 consecutive RGB frames of a video, the output is the downsampled optical flow ground truth. We train with mean squared error between the estimated and ground truth flow values. We obtain the ground truth from our synthetic SURREACT dataset. Qualitative results of our optical flow estimates can be seen in Figures A.2 and A.3 on real and synthetic images, respectively. When we compute the flow on-the-fly for action recognition, we loop over the 16-frame RGB input to compute the flow between every 2 frames and obtain 15-frame flow field as input to the action classification network.
Training details. We give additional details to Section 3.3 of the main paper on the action classification training. We train our networks for 50 epochs with an initial learning rate of 10 −3 which is decreased twice with a factor of 10 −1 at epochs 40

B Additional analyses
We analyze further the synthetic-only training (Section B.1) and synthetic+real training (Section B.2). We define a syn- thetic test set and report the results of the models in the main paper also on this test set. We present additional ablations. We report the confusion matrix on the synthetic test set, as well as on the real test set, which allows us to gain insights about which action categories can be represented better synthetically. Finally, we explore the proposed non-uniform sampling more in Section B.5.

B.1 Synthetic-only training
Here, we define a synthetic test set based on the NTU actions, and perform additional ablations on our synthetic data such as different input modalities beyond RGB and flow, effect of backgrounds, effect of further camera augmentations, and confusion matrix analysis.
Synthetic test set. Similar to SURREAL [83], we separate the assets such as cloth textures, body shapes, backgrounds into train and test splits, which allows us to validate our experiments also on a synthetic test set. Here, we use one sequence per action from the real 0 • test set to generate a small synthetic test set, i.e. 60 motion sequences in total, rendered for the 8 viewpoints, using the test set assets. We report the performance of our models from Tables 4 and 5 of the main paper on this set. We see that even when ground truth, the optical flow performs worse than RGB, indicating difficulty of distinguishing fine-grained actions only with flow fields. Body-part segmentation on the other hand, outperforms other modalities due to providing precise locations for each body part and an abstraction which reduces the gap between the training and test splits. In other words, body-part segmentation is independent of clothing, lighting, background effects, but only contains motion and body shape information. This result highlights that we can improve action recognition by improving body part segmentation as in [98].
Effect of backgrounds. As explained in Section 3.2 of the main paper, we use 2D background images from the target action recognition domain in our synthetic dataset. We perform an experiment whether this helps on the NTU CVS setup. The NTU dataset is recorded in a lab environment, therefore has specific background statistics. We train models by replacing the background pixels of our synthetic videos randomly by LSUN [83,90] images or the original NTU images outside the person bounding boxes. Table A.6 summarizes the results. Using random NTU backgrounds outperform using random LSUN backgrounds. However, we note that the process of using the segmentation mask creates some unrealistic artifacts around the person, which might contribute to the performance degradation. We therefore use the fixed backgrounds from the original renderings in the rest of the experiments.
Effect of camera height/distance augmentations. As stated in Section 3.2 of the main paper, we randomize the height and the distance of the camera to increase the viewpoint diversity within a certain azimuth rotation. We evaluate the importance of this with a controlled experiment in Ta ble A.7. We render two versions of the synthetic training set with 10 sequences per action from 8 viewpoints. The first one has a fixed distance and height at 5 meters and 1 meter, respectively. In the second one, we randomly sample from [4,6] meters for the distance, and [−1, 3] meters for the height. We see that the generalization to real NTU CVS dataset is improved with increased randomness in the synthetic training. Visuals corresponding to the pre-defined range can be found in Figure A.4.
Confusion matrices. We analyze two confusion matrices in Figure A. 5

B.2 Synthetic+Real training
Amount of additional synthetic data. In Figure 7 of the main paper, we plotted the performance against the amount of action sequences in the training set for both synthetic and real datasets. Here, we also report Synth+Real training performance when the Real data is fixed and uses all the action sequences available (i.e., 220 sequences per action), and the Synth data is gradually increased. Synth+Real training strategies. In all our experiments, we combine the training sets of synthetic and real data to train jointly for both datasets, which we referred as Synth+Real.
Here, we investigate whether a different strategy, such as using synthetic data as pre-training (as in Varol et al. [83]), would be more effective. In Table A.9, we present several variations of training strategies. We conclude that our Synth+Real, is simple yet effective, while marginal gains can be obtained by continuing with fine-tuning only on Real data.

B.3 Performance breakdown for object-related actions
While the NTU dataset is mainly targeted for skeleton-based action recognition, many actions involve object interactions. In Table A.10, we analyze the performance breakdown into action categories with and without objects. We notice that the object-related actions have lower performance than bodyonly counterparts even when trained with Real data. The gap is higher when only synthetic training is used since we simulate only humans, without objects. We note that this is still marginal compared to the boost we gain from synthetic data (69.0% in Table 1). Training only a linear layer on frozen features as opposed to end-to-end (e2e) finetuning is also suboptimal.

B.4 Pretraining on Kinetics
Throughout the paper, the networks for NTU training are randomly initialized (i.e., scratch). Here, we investigate whether there is any gain from Kinetics [29] pretraining.

B.5 Non-uniform frame sampling
In this section, we explore the proposed frame sampling strategy further. First, we confirm that the benefits of non-uniform sampling applies also to the flow stream. Since flow is estimated online during training, we can compute flow between any two frames. Note that the flow estimation method is learned on 2 consecutive frames, therefore it produces noisy estimates for large displacements. However, even with this noise, in Table A.12, we demonstrate advantages of non-uniform sampling over consecutive for the flow stream.
Next, we present our experiments about the testing modes as mentioned in Section 3.3 of the main paper. Table A. 13 suggests that the training and testing modes should be the same for both uniform and non-uniform samplings. The convolutional filters adapt to certain training statistics, which should be preserved at test time.
We then investigate the importance of the frame order when we randomly sample non-uniformly. We preserve the  testing on the real NTU CVS split. We explore alternative random sampling schemes besides uniform baseline (1) and the random non-uniform (5).
temporal order in all our experiments, except in Table A.14, where we experiment with a shuffled order. In this case, we observe a significant performance drop which can be explained by the confusion matrix in Figure A.7. The action classes such as wearing and taking off are heavily confused when the order is not preserved. This experiment allows detecting action categories that are temporally symmetric [61]. We also observe that the ordered model fails when tested in nonordered mode, which indicates that the convolutional kernels become highly order-aware. Finally, we experiment with other frame sampling alternatives in Table A   sequences per action, where we only insert a single person per video. When trained with this data, the two-person interaction categories (last 11 classes) are mostly misclassified on the real NTU CVS 0 • view test data The confusions suggest that it is important to model multi-person cases in the synthetic data.
Train/Test: Real 0 • non-uniform, non-ordered wear a shoe → take off a shoe wear on glasses → take off glasses put on a hat → take off a hat sitting down → standing up  Table A.14. The classes that require the temporal order to be distinguished are confused as expected. The training and test is performed on the real NTU CVS 0 • view split.