1 Introduction

Learning human action representations from RGB video data has been widely studied. Recent advances on convolutional neural networks (CNNs) (LeCun et al. 1989) have shown excellent performance (Carreira and Zisserman 2017; Feichtenhofer et al. 2019, 2016; Hara et al. 2018; Lin et al. 2019; Varol et al. 2018; Wang et al. 2016) on benchmark datasets, such as UCF101 (Soomro et al. 2012). However, the success of CNNs rely heavily on the availability of large-scale training data, which is not always the case. To address the lack of training data, several works explore the use of complementary synthetic data for a range of tasks in computer vision such as optical flow estimation, segmentation, human body and hand pose estimation (Dosovitskiy et al. 2015; Shotton et al. 2011; Su et al. 2015; Varol et al. 2017; Zimmermann and Brox 2017). In this work, we raise the question how to synthesize videos for action recognition in the case of limited real data, such as only one viewpoint, or one-shot available at training.

Imagine a surveillance or ambient assisted living system, where a dataset is already collected for a set of actions from a certain camera. Placing a new camera in the environment from a new viewpoint would require re-annotating data because the appearance of an action is drastically different when performed from different viewpoints (Junejo et al. 2011; Liu et al. 2011; Zheng et al. 2016). In fact, we observe that state-of-the-art action recognition networks fail drastically when trained and tested on distinct viewpoints. Specifically, we train the model of Hara et al. (2018) on videos from a benchmark dataset NTU RGB+D (Shahroudy et al. 2016) where people are facing the camera. When we test this network on other front-view (0\(^\circ \)) videos, we obtain \(\sim \)80% accuray. When we test with side-view (90\(^\circ \)) videos, the performance drops to \(\sim \)40% (see Sect. 4). This motivates us to study action recognition from novel viewpoints.

Existing methods addressing cross-view action recognition do not work in challenging setups (e.g. same subjects and similar viewpoints in training and test splits (Shahroudy et al. 2016)). We introduce and study a more challenging protocol with only one viewpoint at training. Recent methods assuming multi-view training data (Li et al. 2018a, b; Wang et al. 2018) also become inapplicable.

A naive way to achieve generalization is to collect data from all views, for all possible conditions, but this is impractical due to combinatorial explosion (Yuille et al. 2018. Instead, we augment the existing real data synthetically to increase the diversity in terms of viewpoints, appearance, and motions. Synthetic humans are relatively easy to render for tasks such as pose estimation, because arbitrary motion capture (MoCap) resource can be used (Shotton et al. 2011; Varol et al. 2017). However, action classification requires certain motion patterns and semantics. It is challenging to generate synthetic data with action labels (De Souza et al. 2017). Typical MoCap datasets (CMU Mocap Database), targeted for pose diversity, are not suitable for action recognition due to lack of clean action annotations. Even if one collects a MoCap dataset, it is still limited to pre-defined set of categories.

In this work, we propose a new, efficient and scalable approach for generating synthetic videos with action labels from the target set of categories. We employ a 3D human motion estimation method, such as HMMR (Kanazawa et al. 2019) and VIBE (Kocabas et al. 2020), that automatically extracts the 3D human dynamics from a single-view RGB video. The resulting sequence of SMPL body (Loper et al. 2015) pose parameters are then combined with other randomized generation components (e.g. viewpoint, clothing) to render diverse complementary training data with action annotations. Figure 1 presents an overview of our pipeline. We demonstrate the advantages of such data when training spatio-temporal CNN models for (1) action recognition from unseen viewpoints and (2) training with one-shot real data. We boost performance on unseen viewpoints from 53.6 to 69.0% on NTU, and from 49.4 to 66.4% on UESTC dataset by augmenting limited real training data with our proposed SURREACT dataset. Furthermore, we present an in-depth analysis about the importance of action relevant augmentations such as diversity of motions and viewpoints, as well as our non-uniform frame sampling strategy which substantially improves the action recognition performance. Our code and data will be available at the project pageFootnote 1.

2 Related Work

Human action recognition is a well-established research field. For a broad review of the literature on action recognition, see the recent survey of Kong et al. Kong and Fu (2018). Here, we focus on relevant works on synthetic data, cross-view action recognition, and briefly on 3D human shape estimation.

Synthetic Humans. Simulating human motion dates back to 1980s. Badler et al. (1993) provide an extensive overview of early approaches. More recently, synthetic images of people have been used to train visual models for 2D/3D body pose and shape estimation (Chen et al. 2016; Ghezelghieh et al. 2016; Liu et al. 2019a; Pishchulin et al. 2012; Shotton et al. 2011; Varol et al. 2018), part segmentation (Shotton et al. 2011; Varol et al. 2017), depth estimation (Varol et al. 2017), multi-person pose estimation (Hoffmann et al. 2019), pedestrian detection (Marin et al. 2010; Pishchulin et al. 2012), person re-identification (Qian et al. 2018), hand pose estimation (Hasson et al. 2019; Zimmermann and Brox 2017), and face recognition (Kortylewski et al. 2018; Masi et al. 2019). Synthetic datasets built for these tasks, such as the recent SURREAL dataset (Varol et al. 2017), however, do not provide action labels.

Among previous works that focus on synthetic human data, very few tackle action recognition (De Souza et al. 2017; Liu et al. 2019b; Rahmani and Mian 2016). Synthetic 2D human pose sequences (Lv and Nevatia 2007) and synthetic point trajectories (Rahmani and Mian 2015; Rahmani et al. 2018; Jingtian et al. 2018) have been used for view-invariant action recognition. However, RGB-based synthetic training for action recognition is relatively new, with (De Souza et al. 2017) being one of the first attempts. De Souza et al. (2017) manually define 35 action classes and jointly estimate real categories and synthetic categories in a multi-task setting. However, their categories are not easily scalable and do not necessarily relate to the target set of classes. Unlike (De Souza et al. 2017), we automatically extract motion sequences from real data, making the method flexible for new categories. Recently, (Puig et al. 2018) has generated the VirtualHome dataset, a simulation environment with programmatically defined synthetic activities using crowd-sourcing. Different than our work, the focus of Puig et al. (2018) is not generalization to real data.

Most relevant to ours, (Liu et al. 2019b) generates synthetic training images to achieve better performance on unseen viewpoints. The work of Liu et al. (Liu et al. 2019b) is an extension of Rahmani and Mian (2016) by using RGB-D as input instead of depth only. Both works formulate a frame-based pose classification problem on their synthetic data, which they then use as features for action recognition. These features are not necessarily discriminative for the target action categories. Different than this direction, we explicitly assign an action label to synthetic videos and define the supervision directly on action classification.

Cross-View Action Recognition. Due to the difficulty of building multi-view action recognition datasets, the standard benchmarks have been recorded in controlled environments. RGB-D datasets such as IXMAS (Weinland et al. 2007), UWA3D II (Rahmani et al. 2016) and N-UCLA (Wang et al. 2014) were state of the art until the availability of the large-scale NTU RGB+D dataset (Shahroudy et al. 2016). The size of NTU allows training deep neural networks unlike previous datasets. Very recently, Ji et al. (Ji et al. 2018) collected the first large-scale dataset, UESTC, that has a 360\(^\circ \) coverage around the performer, although still in a lab setting.

Since multi-view action datasets are typically captured with depth sensing devices, such as Kinect, they also provide an accurate estimate of the 3D skeleton. Skeleton-based cross-view action recognition therefore received a lot of attention in the past decade (Ke et al. 2017; Liu et al. 2016, 2017a, b; Zhang et al. 2017). Variants of LSTMs (Hochreiter and Schmidhuber 1997) have been widely used (Liu et al. 2016, 2017a; Shahroudy et al. 2016). Recently, spatio-temporal skeletons were represented as images (Ke et al. 2017) or higher dimensional objects (Liu et al. 2017b)where standard CNN architectures were applied.

RGB-based cross-view action recognition is in comparison less studied. Transforming RGB features to be view-invariant is not as trivial as transforming 3D skeletons. Early work on transferring appearance features from the source view to the target view explored the use of maximum margin clustering to build a joint codebook for temporally synchronous videos Farhadi and Tabrizi 2008. Following this approach, several other works focused on building global codebooks to extract view-invariant representations (Kong et al. 2017; Liu et al. 2019c; Rahmani et al. 2018; Zheng and Jiang 2013; Zheng et al. 2016). Recently, end-to-end approaches used human pose information as guidance for action recognition (Baradel et al. 2017; Liu and Yuan 2018; Luvizon et al. 2018; Zolfaghari et al. 2017). Li et al. (2018a) formulated an adversarial view-classifier to achieve view-invariance. Wang et al. (Wang et al. 2018) proposed to fuse view-specific features from a multi-branch CNN. Such approaches cannot handle single-view training (Li et al. 2018a; Wang et al. 2018). Our method differs from these works by compensating for the lack of view diversity with synthetic videos. We augment the real data automatically at training time, and our model does not involve any extra cost at test time unlike (Wang et al. 2018). Moreover, we do not assume real multi-view videos at training.

3D Human Shape Estimation. Recovering the full human body mesh from a single image has been explored as a model-fitting problem (Bogo et al. 2016; Lassner et al. 2017), as regressing model parameters with CNNs (Kanazawa et al. 2018; Omran et al. 2018; Pavlakos et al. 2018; Tung et al. 2017), and as regressing non-parametric representations such as graphs or volumes (Kolotouros et al. 2019; Varol et al. 2018). Recently, CNN-based parameter regression approaches have been extended to video (Kanazawa et al. 2019; Liu et al. 2019a; Kocabas et al. 2020). HMMR (Kanazawa et al. 2019) builds on the single-image-based HMR (Kanazawa et al. 2018) to learn the human dynamics by using 1D temporal convolutions. More recently, VIBE (Kocabas et al. 2020) adopts a recurrent model based on frame-level pose estimates provided by SPIN (Kolotouros et al. 2019). VIBE also incorporates an adversarial loss that penalizes the estimated pose sequence if it is not a ‘realistic’ motion, i.e., indistinguishable from the real AMASS (Mahmood et al. 2019) MoCap sequences. In this work, we recover 3D body parameters from real videos using HMMR (Kanazawa et al. 2019) and VIBE (Kocabas et al. 2020). Both methods employ the SMPL body model (Loper et al. 2015). We provide a comparison between the two methods for our purpose of action recognition, which can serve as a proxy task to evaluate motion estimation.

3 Synthetic Humans with Action Labels

Our goal is to improve the performance of action recognition using synthetic data in cases where the real data is limited, e.g. domain mismatch between training/test such as viewpoints or low-data regime. In the following, we describe the three stages of: (1) obtaining 3D temporal models for human actions from real training sequences (at a particular viewpoint) (Sect. 3.1); (2) using these 3D temporal models to generate training sequences for new (and the original) viewpoints using a rendering pipeline with augmentation (Sect. 3.2); and (3) training a spatio-temporal CNN with both real and synthetic data (Sect. 3.3).

3.1 3D Human Motion Estimation

In order to generate a synthetic video with graphics techniques, we need to have a sequence of articulated 3D human body models. We employ the parametric body model SMPL (Loper et al. 2015), which is a statistical model, learned over thousands of 3D scans. SMPL generates the mesh of a person given the disentangled pose and shape parameters. The pose parameters (\(\mathbb {R}^{72}\)) control the kinematic deformations due to skeletal posture, while the shape parameters (\(\mathbb {R}^{10}\)) control identity-specific deformations such as the person height.

We hypothesize that a human action can be captured by the sequence of pose parameters, and that the shape parameters are largely irrelevant (note, this may not necessarily be true for human-object interaction categories). Given reliable 3D pose sequences from action recognition video datasets, we can transfer the associated action labels to synthetic videos. We use the recent method of Kanazawa et al. (Kanazawa et al. 2019), namely human mesh and motion recovery (HMMR), unless stated otherwise. HMMR extends the single-image reconstruction method HMR (Kanazawa et al. 2018) to video with a multi-frame CNN that takes into account a temporal neighborhood around a video frame. HMMR learns a temporal representation for human dynamics by incorporating large-scale 2D pseudo-ground truth poses for in-the-wild videos. It uses PoseFlow (Zhang et al. 2018)and AlphaPose (Fang et al. 2017) for multi-person 2D pose estimation and tracking as a pre-processing step. Each person crop is then given as input to the CNN for estimating the pose and shape, as well as the weak-perspective camera parameters. We refer the reader to (Kanazawa et al. 2019)for more details. We choose this method for the robustness on in-the-wild videos, ability to capture multiple people, and the smoothness of the recovered motion, which are important for our generalization from synthetic videos to real. Figure 1 presents the 3D pose animated synthetically for sample video frames. We also experiment with the more recent motion estimation method, VIBE (Kocabas et al. 2020), and show that improvements in motion estimation proportionally affect the action recognition performance in our pipeline. Note that we only use the pose parameters from HMMR or VIBE, and randomly change the shape parameters, camera parameters, and other factors. Next, we present the augmentations in our synthetic data generation.

3.2 SURREACT Dataset Components

In this section, we give details on our synthetic dataset, SURREACT (Synthetic hUmans foR REal ACTions).

We follow (Varol et al. 2017) and render 3D SMPL sequences with randomized cloth textures, lighting, and body shapes. We animate the body model with our automatically extracted pose dynamics as described in the previous section. We explore various motion augmentation techniques to increase intra-class diversity in our training videos. We incorporate multi-person videos which are especially important for two-people interaction categories. We also systematically sample from 8 viewpoints around a circle to perform controlled experiments. Different augmentations are illustrated in Fig. 2 for a sample synthetic frame. Visualizations from SURREACT are further provided in Fig. 3.

Each generated video has automatic ground truth for 3D joint locations, part segmentation, optical flow, and SMPL body (Loper et al. 2015) parameters, as well as an action label, which we use for training a video-based 3D CNN for action classification. We use other ground truth modalities as input to action recognition as oracle experiments (see Table 14).We further use the optical flow ground truth to train a flow estimator and use the segmentation to randomly augment the background pixels in some experiments.

Our new SURREACT dataset differs from the SURREAL dataset (Varol et al. 2017) mainly by providing action labels, exploring motion augmentation, and by using automatically extracted motion sequences instead of MoCap recordings (CMU Mocap Database). Moreover, Varol et al. (Varol et al. 2017) do not exploit the temporal aspect of their dataset, but only train CNNs with single-image input. We further employ multi-person videos and a systematic viewpoint distribution.

Fig. 1
figure 1

Synthetic humans for actions: We estimate 3D shape from real videos and automatically render synthetic videos with action labels. We explore various augmentations for motions, viewpoints, and appearance. Training temporal CNNs with this data significantly improves the action recognition from unseen viewpoints

Fig. 2
figure 2

Augmentations: We illustrate different augmentations of the SURREACT dataset for the hand waving action. We modify the joint angles with additive noise on the pose parameters for motion augmentation. We systematically change the camera position to create viewpoint diversity. We sample from a large set of body shape parameters, backgrounds, and clothing to randomize appearances

Fig. 3
figure 3

SURREACT: We visualize samples from SURREACT for the actions from the NTU (left) and the UESTC (right) datasets. The motions are estimated using HMMR. Each real video frame is accompanied with three synthetic augmentations. On the left, we show the variations in clothes, body shapes, backgrounds, camera height/distance from the original 0\(^\circ \) viewpoint. On the right, we show the variations in viewpoints for 0\(^\circ \), 45\(^\circ \), and 90\(^\circ \) views. The complete list of actions can be found as a video at the project page (SURREACT project page)

Motion Augmentation. Automatic extraction of 3D sequences from 2D videos poses an additional challenge in our dataset compared to clean high-quality MoCap sequences. To reduce the jitter, we temporally smooth the estimated SMPL pose parameters by weighted linear averaging. SMPL poses are represented as axis-angle rotations between joints. We convert them into quaternions when we apply linear operations, then normalize each quaternion to have a unit norm, before converting back to axis-angles. Even with this processing, the motions may remain noisy, which is inevitable given that monocular 3D motion estimation is a difficult task on its own. Our findings interestingly suggest that the synthetic human videos are still beneficial when the motions are noisy.

To increase motion diversity, we further perturb the pose parameters with various augmentations. Specifically, we use a video-level additive noise on the quaternions for each body joint to slightly change the poses, as an intra-individual augmentation. We also experiment with an inter-individual augmentation by interpolating between motion sequences of the same action class. Given a pair of sequences from two individuals, we first align them with dynamic time warping (Sakoe and Chiba 1978), then we linearly interpolate the quaternions of the time-aligned sequences to generate a new sequence, which we refer as interpolation. A visual explanation of the process can be found in We show significant gains by increasing motion diversity.

Multi-person. We use the 2D pose information from (Fang et al. 2017; Zhang et al. 2018) to count the number of people in the real video. In the case of a single-person, we center the person on the image and do not add 3D translation to the body, i.e., the person is centered independently for each frame. While such constant global positioning of the body loses information for some actions such as walking and jumping, we find that the translation estimate adds more noise to consider this information and potentially increases the domain gap with the real where no such noise exists (see Appendix A). If there is more than one person, we insert additional body model(s) for rendering. We translate each person in the xy image plane. Note that we do not translate the person in full xyz space. We observe that the z component of the translation estimation is not reliable due to the depth ambiguity therefore the people are always centered at \(z=0\). More explanations about the reason for omitting the z component can be found in Appendix A.We temporally smooth the translations to reduce the noise. We subtract the mean of translations across the video and across the people to roughly center all people to the frame. We therefore keep the relative distances between people, which is important for actions such as walking towards each other.

Viewpoints. We systematically render each motion sequence 8 times by randomizing all other generation parameters at each view. In particular, we place the camera to be rotated at {\(0^{\circ }, 45^{\circ }, 90^{\circ }, 135^{\circ }, 180^{\circ }, 225^{\circ }, 270^{\circ },\) \(315^{\circ }\)} azimuth angles with respect to the origin, denoted as (\(0^\circ \):\(45^\circ \):\(360^\circ \)) in our experiments. The distance of the camera from the origin and the height of the camera from the ground are randomly sampled from a predefined range: [4, 6] meters for the distance, \([-1, 3]\) meters for the height. This can be adjusted according to the target test setting.

Backgrounds. Since we have access to the target real dataset where we run pose estimation methods, we can extract background pixels directly from the training set of this dataset. We crop from regions without the person to obtain static backgrounds for the NTU and UESTC datasets. We experimentally show the benefits of using the target dataset backgrounds in the Appendix (see Table 15).For Kinetics experiments, we render human bodies on top of unconstrained videos from non-overlapping action classes and show benefits over static backgrounds. Note that these background videos might also include human pixels.

3.3 Training 3D CNNs with Non-Uniform Frames

Following the success of 3D CNNs for video recognition (Carreira and Zisserman 2017; Hara et al. 2018; Tran et al. 2015), we employ a spatio-temporal convolutional architecture that operates on multi-frame video inputs. Unless otherwise specified, our network architecture is 3D ResNet-50 (Hara et al. 2018) and its weights are randomly initialized (see Appendix B.4 for pretraining experiments).

To study the generalization capability of synthetic data across different input modalities, we train one CNN for RGB and another for optical flow as in Simonyan and Zisserman (2014). We average the scores with equal weights when reporting the fusion.

We subsample fixed-sized inputs from videos to have a \(16\times 256\times 256\) spatio-temporal resolution, in terms of number of frames, width, and height, respectively. In case of optical flow input, we map the RGB input to \(15\times 64\times 64\) dimensional flow estimates. To estimate flow, we train a two-stack hourglass architecture (Newell et al. 2016) with our synthetic data for flow estimation on 2 consecutive frames. We refer the reader to Figure 10 for the qualitative results of our optical flow estimation.

Non-Uniform Frame Sampling. We adopt a different frame sampling strategy than most works (Carreira and Zisserman 2017; Feichtenhofer et al. 2019; Hara et al. 2018) in the context of 3D CNNs. Instead of uniformly sampling (at a fixed frame rate) a video clip with consecutive frames, we randomly sample frames across time by keeping their temporal order, which we refer as non-uniform sampling. Although recent works explore multiple temporal resolutions, e.g. by regularly sampling at two different frame rates (Feichtenhofer et al. 2019), or randomly selecting a frame rate (Zhu and Newsam 2018), the sampled frames are equidistant from each other. TSN (Wang et al. 2016) and ECO (Zolfaghari et al. 2018) employ a hybrid strategy by regularly sampling temporal segments and randomly sampling a frame from each segment, which is a more restricted special case of our strategy. Moreover, TSN uses a 2D CNN without temporal modelling. Zolfaghari et al. (2018) also has 2D convolutional features on each frame, which are stacked as input to a 3D CNN only at the end of the network. None of these works provide controlled experiments to quantify the effect of their sampling strategy. The concurrent work of Chen et al. (2020) presents an experimental analysis comparing the dense consecutive sampling with the hybrid sampling of TSN.

Figure 4 compares the consecutive sampling with our non-uniform sampling. In our experiments, we report results for both and show improvements for the latter. Our videos are temporally trimmed around the action, therefore, each video is short, i.e. spans several seconds. During training we randomly sample 16 video frames as a fixed-sized input to 3D CNN. Thus, the convolutional kernels become speed-invariant to some degree. This can be seen as a data augmentation technique, as well as a way to capture long-term cues.

Fig. 4
figure 4

Frame sampling: We illustrate our non-uniform frame sampling strategy in our 3D CNN training. Compared to the commonly adopted consecutive setting which uniformly samples with a fixed frame rate, non-uniform sampling has random skips in time, allowing speed augmentations and long-term context

Fig. 5
figure 5

Datasets: We show sample video frames from the multi-view datasets used in our experiments. NTU and UESTC datasets have 3 and 8 viewpoints, respectively. NTU views correspond to \(0^\circ \), \(45^\circ \), and \(90^\circ \) from left to right. UESTC covers \(360^\circ \) around the performer

At test time, we sample several 16-frame clips and average the softmax scores. If we test the uniform case, we sample non-overlapping consecutive clips with sliding window. For the non-uniform case, we randomly sample as many non-uniform clips as the number of sliding windows for the uniform case. In other words, the number of sampled clips is proportional to the video length. More precisely, let T be the number of frames in the entire test video, F be the number of input frames per clip, and S be the stride parameter. We sample N clips where \(N = \left\lceil (T - F) / S)\right\rceil + 1\). In our case \(F=16\), \(S=16\). We apply sliding window for the uniform case. For the non-uniform case, we sample N clips, where each clip is an ordered random (without replacement) 16-frame subset from T. We observe that it is important to train and test with the same sampling scheme, and keeping the temporal order is important. More details can be found in Appendix B.5.

Synth+Real. Since each real video is augmented multiple times (e.g. 8 times for 8 views), we have more synthetic data than real. When we add synthetic data to training, we balance the real and synthetic datasets such that at each epoch we randomly subsample from the synthetic videos to have equal number for both real and synthetic.

We minimize the cross-entropy loss using RMSprop (Tieleman and Hinton 2012) with mini-batches of size 10 and an initial learning rate of \(10^{-3}\) with a fixed schedule. Color augmentation is used for the RGB stream. Other implementation details are given in Appendix A.

Table 1 Training jointly on synthetic and real data substantially boosts the performance compared to only real training on NTU CVS protocol, especially on unseen views (\(45^\circ \), \(90^\circ \)) (e.g., 69.0% vs 53.6%). The improvement can be seen for both RGB and Flow streams, as well as the fusion. We note the marginal improvements with the addition of flow unlike in other tasks where flow has been used to reduce the synthetic-real domain gap (Doersch and Zisserman 2019). We render two different versions of the synthetic dataset using HMMR and VIBE motion estimation methods, and observe improvements with VIBE. Moreover, training on synthetic videos alone is able to obtain 63.0% accuracy
Table 2 Real baselines: Training and testing with our cross-view-subject (CVS) protocol of the NTU dataset using only real RGB videos. Rows and columns correspond to training and testing sets, respectively. Training and testing on the same viewpoint shows the best performance as can be seen by the diagonals of the first three rows. This shows the domain gap present between \(0^\circ \), \(45^\circ \), \(90^\circ \) viewpoints. If we add more viewpoints to the training (last two rows) we account for the domain gap. Non-uniform frame sampling (right) consistently outperforms the uniform frame sampling (left)

4 Experiments

In this section, we start by presenting the action recognition datasets used in our experiments (Sect. 4.1). Next, we present extensive ablations for action recognition from unseen viewpoints (Sect. 4.2). Then, we compare our results to the state of the art for completeness (Sect. 4.3). Finally, we illustrate our approach on in-the-wild videos (Sect. 4.4).

4.1 Datasets and Evaluation Protocols

We briefly present the datasets used in this work, as well as the evaluation protocols employed.

Fig. 6
figure 6

Inputting raw motion parameters performs significantly worse for the \(90^\circ \) unseen viewpoint compared to synthetic renderings on the NTU CVS protocol. We compare various input representations with increasing view-independence (joint coordinates, SMPL pose parameters, SMPL pose parameters without the global rotation). Experiments are carried out with SMPL model recovered with RGB-based methods HMMR (Kanazawa et al. 2019) and VIBE (Kocabas et al. 2020), and depth-based Kinect joints. A 2D ResNet architecture is used for motion parameter inputs similar to Ke et al. (2017). We also present an architecture study in Table 3. Note that significant gains are further possible when mixing the synthetic renderings with real videos. See text for interpretation

NTU RGB+D Dataset (NTU). This dataset (Shahroudy et al. 2016) captures 60 actions with 3 synchronous cameras (see Fig. 5). The large scale (56K videos) of the dataset allows training deep neural networks. Each sequence has 84 frames on average. The standard protocols (Shahroudy et al. 2016) report accuracy for cross-view and cross-subject splits. The cross-view (CV) split considers 0\(^\circ \) and 90 views as training and 45\(^\circ \) view as test, and the same subjects appear both in training and test. For the cross-subject (CS) setting, 20 subjects are used for training, the remaining 20 for test, and all 3 views are seen at both training and test. We report on the standard protocols to be able to compare to the state of the art (see Table 8). However, we introduce a new protocol to make the task more challenging. From the cross-subject training split that has all 3 views, we take only 0\(^\circ \) viewpoint for training, and we test on the 0\(^\circ \), 45\(^\circ \), 90\(^\circ \) views of the cross-subject test split. We call this protocol cross-view-subject (CVS). Our focus is mainly to improve for the unseen and distinct view of 90\(^\circ \).

UESTC RGB-D Varying-view 3D Action Dataset (UESTC). UESTC is a recent dataset (Ji et al. 2018) that systematically collects 8 equally separated viewpoints that cover 360\(^\circ \) around a person (see Fig. 5). In total, the dataset has 118 subjects, 40 actions categories, and 26500 videos of more than 200 frames each. This dataset allows studying actions from unusual views such as behind the person. We use the official protocol Cross View I (CV-I), suitable for our task, which trains with 1 viewpoint and tests with all other 7 for each view. The final performance is evaluated as the average across all tests. For completeness, we also report the Cross View II (CV-II) protocol that concentrates on multi-view training, i.e., training with even viewpoints (FV, V2, V4, V6) and testing with odd viewpoints (V1, V3, V5, V7), and vice versa.

One-shot Kinetics-15 Dataset (Kinetics-15). Since we wish to formulate a one-shot scenario from in-the-wild Kinetics (Kay et al. 2017) videos, we need a pre-trained model to serve as feature extractor. We use a model pre-trained on Mini-Kinetics-200 (Xie et al. 2017), a subset of Kinetics-400. We define the novel classes from the remaining 200 categories which can be described by body motions. This procedure resulted in a 15-class subset of Kinetics-400: bending back, clapping, climbing a rope, exercising arm, hugging, jogging, jumpstyle dancing, krumping, push up, shaking hands, skipping rope, stretching arm, swinging legs, sweeping floor, wrestling. Note that many of the categories such as waiting in line, dining, holding snake cannot be recognized solely by their body motions, but additional contextual cues are needed. From the 15 actions, we randomly sample 1 training video per class (see Fig. 8 for example videos with their synthetic augmentations). The training set therefore consists of 15 videos. For testing, we report accuracy on all 725 validation videos from these 15 classes. The limitation of this protocol is that it is sensitive to the choice of the 15 training videos, e.g., if 3D motion estimation fails on one video, the model will not benefit from additional synthetic data of one class. Future work can consider multiple possible training sets (e.g., sampling videos where 3D pose estimation is confident) and report average performance.

4.2 Ablation Study

We first compare real-only (Real), synthetic-only (Synth), and mixed synthetic and real (Synth+Real) training. Next, we explore the effect of the motion estimation quality and inputting raw motion parameters as opposed to synthetic renderings. Then, we experiment with the different synthetic data generation parameters to analyze the effects of viewpoint and motion diversity. In all cases, we evaluate our models on real test videos.

Real Baselines. We start with our cross-view-subject protocol on NTU by training only with real data. Table 2 summarizes the results of training the model on a single-view and testing on all views. We observe a clear domain gap between different viewpoints, which can be naturally reduced by adding more views in training. However, in the case when a single view is available, this would not be possible. If we train only with \(0^\circ \), the performance is high (83.9%) when tested on \(0^\circ \), but significantly drops (42.9%) when tested on \(90^\circ \). In the remaining of our experiments on NTU, we assume that only the frontal viewpoint (\(0^\circ \)) is available.

Non-Uniform Frame Sampling. We note the consistent improvement of non-uniform frame sampling over the uniform consecutive sampling in all settings in Table 2. Additional experiments about video frame sampling, such as the optical flow stream, can be found in Appendix B.5. We use our non-uniform sampling strategy for both RGB and flow streams in the remainder of experiments unless specified otherwise.

Synth+Real Training. Next, we report the improvements obtained by synthetically increasing view diversity. We train the 60 action classes from NTU by combining the real \(0^\circ \) training data and the synthetic data augmented from real with 8 viewpoints, i.e. \(0^\circ \):\(45^\circ \):\(360^\circ \). Table 1 compares the results of Real, Synth, and Synth+Real trainings for RGB and Flow streams, as well as their combination. The performance of the flow stream is generally lower than that of the RGB stream, possibly due to the fine-grained categories which cannot be distinguished with coarse motion fields.

It is interesting to note that training only with synthetic data (Synth) reaches 63.0% accuracy on real \(0^\circ \) test data which indicates a certain level of generalization capability from synthetic to real. Combining real and synthetic training videos (Real+Synth), the performance of the RGB stream increases from 53.6% to 69.0% compared to only real training (Real), on the challenging unseen \(90^\circ \) viewpoint. Note that the additional synthetic videos can be obtained ‘for free’, i.e. without extra annotation cost. We also confirm that even the noisy motion estimates are sufficient to obtain significant improvements, suggesting that the discriminative action information is still present in our synthetic data.

Fig. 7
figure 7

Amount of data: The number of real sequences per action for: Real, Synth, Synth+Real training on NTU CVS split. Generalization to unseen viewpoints is significantly improved with the addition of synthetic data (green) compared to training only with real (pink). Real training contains the 0\(^\circ \) view. We experiment with all 8 views (green) or only the 0\(^\circ \) view (yellow) in the additional synthetic data. See text for interpretation

The advantage of having a controllable data generation procedure is to be able to analyze what components of the synthetic data are important. In the following, we examine a few of these aspects, such as quality of the motion estimation, input representation, amount of data, view diversity, and motion diversity. Additional results can be found in Appendix B.

Quality of the Motion Estimation: HMMR vs VIBE. 3D motion estimation from monocular videos has only recently demonstrated convincing performance on unconstrained videos, opening up the possibility to investigate our problem of action recognition with synthetic videos. One natural question is whether the progress in 3D motion estimation methods will improve the synthetic data. To this end, we compare two sets of synthetic data, keeping all the factors the same except the motion source: Synth\(_ HMMR \) extracted with HMMR (Kanazawa et al. 2019), Synth\(_ VIBE \) extracted with VIBE (Kocabas et al. 2020). Table 1 presents the results. We observe consistent improvements with more accurate pose estimation from VIBE over HMMR, suggesting that our proposed pipeline has great potential to further improve with the progress in 3D recovery.

Raw Motion Parameters as Input. Another question is whether the motion estimation output, i.e., body pose parameters, can be directly used as input to an action recognition model instead of going through synthetic renderings. We implement a simple 2D CNN architecture similar to Ke et al. (2017) that inputs 16-frame pose sequence in the form of 3D joint coordinates (24 joints for SMPL, 25 joints for Kinect) or 3D joint rotations (24 axis-angle parent-relative rotations for SMPL, or 23 without the global rotation). In particular, we use a ResNet-18 architecture (He et al. 2015). We experiment with both HMMR and VIBE to use SMPL parameters as input, as well as Kinect joints provided by the NTU dataset for comparison. Figure 6 reports the results of various pose representations against the performance of synthetic renderings for three test views. We make several observations: (1) Removing viewpoint-dependent factors, e.g., pose parameters over joints, degrades performance on seen viewpoint, but consistently improves on unseen viewpoints; (2) Synthetic video renderings from all viewpoints significantly improve over raw motion parameters for the challenging unseen viewpoint; (3) VIBE outperforms HMMR; (4) Both RGB-based motion estimation methods are competitive with the depth-based Kinect joints.

Table 3 Architecture comparison: We explore the influence of architectural improvements for pose-based action recognition models: 2D ResNet with temporal convolutions versus ST-GCN with graph convolutions on the SMPL pose parameters obtained by VIBE. While ST-GCN improves over 2D ResNet, the performance of the synthetic-only training with renderings remain superior for the unseen \(90^\circ \) viewpoint

We note the significant boost with renderings (45.3%) over pose parameters (29.0%) for the \(90^\circ \) test view despite the same source of motion information for both. There are three main differences which can be potential reasons. First, the architectures 3D ResNet and 2D ResNet have different capacities. Second, motion estimation from non-frontal viewpoints can be challenging, negatively affecting the performance of pose-based methods, but not affecting 3D ResNet (because pose estimation is not a required step). Third, the renderings have the advantage that standard data augmentation techniques on image pixels can be applied, unlike the pose parameters which are not augmented. More importantly, the renderings have the advantage that they can be mixed with the real videos, which showed to substantially improve the performance in Table 1.

To explore the architectural capacity question, we study the pose-based action recognition model further and experiment with the recent ST-GCN model (Yan et al. 2018) that makes use of graph convolutions. For this experiment, we use VIBE pose estimates and compare ST-GCN with the 2D ResNet architecture in Table 3. Although we observe improvements with using ST-GCN (29.0% vs 36.2%), the synthetic renderings provide significantly better generalization to the unseen 90\(^\circ \) view (45.3%).

Amount of Data. In the NTU CVS training split, we have about 220 sequences per action. We take subsets with {10, 30, 60, 100} sequences per action, and train the three scenarios: Real, Synth, Synth+Real, for each subset. Figure 7 plots the performance versus the amount of data for these scenarios, for both RGB and Flow streams. We observe the consistent improvement of complementary synthetic training, especially for unseen viewpoints. We also see that it is more effective to use synthetic data at a given number of sequences per action. For example, on the \(90^\circ \) viewpoint, increasing the number of sequences from 100 to 220 in the real data results only in 4.6% improvement (49.0% vs 53.6%, Real), while one can synthetically augment the existing 100 sequences per action and obtain 64.7% (Synth+Real) accuracy without spending extra annotation effort.

View Diversity. We wish to confirm that the improvements presented so far are mainly due to the viewpoint variation in synthetic data. The “Synth(v=\(0^\circ \)) + Real” plot in Fig. 7 indicates that only the \(0^\circ \) viewpoint from synthetic data is used. In this case, we observe that the improvement is not consistent. Therefore, it is important to augment viewpoints to obtain improvements. Moreover, we experiment with having only \(\pm 45^\circ \) or \(\pm 90^\circ \) views in the synthetic-only training for 60 sequences per action. In Table 4, we observe that the test performance is higher when the synthetic training view matches the real test view. However, having all 8 viewpoints at training benefits all test views.

Table 4 Viewpoint diversity: The effect of the views in the synthetic training on the NTU CVS split. We train only with synthetic videos obtained from real data of 60 sequences per action. We take a subset of views from the synthetic data: 0\(^\circ \), ±45\(^\circ \), ±90\(^\circ \). Even when synthetic, the performance is better when the viewpoints match between training and test. The best performance is obtained with all 8 viewpoints combined
Table 5 Motion diversity: We study the effect of motion diversity in the synthetic training on a subset of the NTU CVS split. The results indicate that clothing, body shape diversity is not as important as motion diversity (second and last rows). We can significantly improve the performance by motion augmentations, especially with a video-level additive noise on the joint rotations (second and sixth rows). Here, each dataset is rendered with all 8 views and the training is only performed on synthetic data. At each rendering, we randomly sample clothing, body shape, lighting etc
Table 6 UESTC dataset Cross View I protocol: Training on 1 viewpoint and testing on all the others. The plots on the right show individual performances for the RGB networks. The rows and columns of the matrices correspond to training and testing views, respectively. We obtain significant improvements over the state of the art, due to our non-uniform frame sampling and synthetic training
Table 7 UESTC dataset Cross View II protocol: Training on 4 odd viewpoints, testing on 4 even viewpoints (left), and vice versa (right). We present the results on both splits and their average for the RGB and Flow streams, as well as the RGB+Flow late fusion. Real+Synth training consistently outperforms the Real baseline

Motion Diversity. Next, we investigate the question whether motions can be diversified and whether this is beneficial for synthetic training. There are very few attempts towards this direction (De Souza et al. 2017) since synthetic data has been mainly used for static images. Recently, (Liu et al. 2019a) introduced interpolation between distinct poses to create new poses in synthetic data for training 3D pose estimation; however, its contribution over existing poses was not experimentally validated. In our case, we need to preserve the action information, therefore, we cannot generate unconstrained motions. Generating realistic motions is a challenging research problem on its own and is out of the scope of this paper. Here, we experiment with motion augmentation to increase diversity.

As explained in Sect. 3.1, we generate new motion sequences by (1) interpolating between motion pairs of the same class, or by (2) additive noise on the pose parameters. Table 5 presents the results of this analysis when we train only with synthetic data and test on the NTU CVS protocol. We compare to the baseline where 10 motion sequences per action are rendered once per viewpoint (the first row). We render the same sequences without motion augmentation 6 times (the second row) and obtain marginal improvement. On the other hand, having 60 real motion sequences per action significantly improves (last row) and is our upper bound for the motion augmentation experiments. That means that the clothing, body shape, lighting, i.e. appearance diversity is not as important as motion diversity. We see that generating new sequences with interpolations improves over the baseline. Moreover, perturbing the joint rotations across the video with additive noise is simple and effective, with performance increase of about 5% (26.2% vs 31.5%) over rendering 6 times without motion augmentation. To justify the video-level noise (i.e., one value to add to all frames), in Table 5, we also experiment with frame-level noise and a hybrid version where we independently sample a noise at every 25 frames, which are interpolated for the frames in between. These renderings qualitatively remain very noisy, reducing the performance in return.

Table 8 State of the art comparison: We report on the standard protocols of NTU for completeness. We improve previous RGB-based methods (bottom) due to non-uniform sampling and synthetic training. Additional cues extracted from RGB modality are denoted in parenthesis. We perform on par with skeleton-based methods (top) without using the Kinect sensor

4.3 Comparison with the State of the Art

In the following, we employ the standard protocols for UESTC and NTU datasets, and compare our performance with other works. Tables 6 and 7 compare our results to the state-of-the-art methods reported by Ji et al. (2018) on the recently released UESTC dataset, on CV-I and CV-II protocols. To augment the UESTC dataset, we use the VIBE motion estimation method. We outperform the RGB-based methods JOULE (Hu et al. 2017) and 3D ResNeXt-101 (Hara et al. 2018) by a large margin even though we use a less deep 3D ResNet-50 architecture. We note that we have trained the ResNeXt-101 architecture (Hara et al. 2018) with our implementation and obtained better results than our ResNet-50 architecture (45.2% vs 36.1% on CV-I, 82.5% vs 76.1% on CV-II). This contradicts the results reported in Ji et al. (2018). We note that a first improvement can be attributed to our non-uniform frame sampling strategy. Therefore, we report our uniform real baseline as well. A significant performance boost is later obtained by having a mixture of synthetic and real training data. Using only RGB input, we obtain 17.0% improvement on the challenging CV-I protocol over real data (66.4 vs 49.4). Using both RGB and flow, we obtain 44.1% improvement over the state of the art (76.1 vs 32.0). We also report on the even/odd test splits of the CV-II protocol that have access to multi-view training data. The synthetic data again shows benefits over the real baselines. Compared to NTU, which contains object interactions that we do not simulate, the UESTC dataset focuses more on the anatomic movements, such as body exercises. We believe that these results convincingly demonstrate the generalization capability of our efficient synthetic data generation method to real body motion videos.

In Table 8, we compare our results to the state-of-the-art methods on standard NTU splits. The synthetic videos are generated using the HMMR motion estimation method. Our results on both splits achieve state-of-the-art performance only with the RGB modality. In comparison, (Baradel et al. 2018; Luvizon et al. 2018; Zolfaghari et al. 2017) use pose information during training. Luo et al. (2018) uses other modalities from Kinect such as depth and skeleton during training. Similar to us, (Wang et al. 2018) uses a two-stream approach. Our non-uniform sampling boosts the performance. We have moderate gains with the synthetic data for both RGB and flow streams, as the real training set is already large and similar to the test set.

Fig. 8
figure 8

Sample video frames from the one-shot Kinetics-15 dataset. We provide side-by-side illustrations for real frames and their synthetically augmented versions from the original viewpoint. Note that we render the synthetic body on a static background for computational efficiency, but augment it during training with random real videos by using the segmentation mask

4.4 One-Shot Training

We test the limits of our approach on unconstrained videos of the Kinetics-15 dataset. These videos are challenging for several reasons. First, the 3D human motion estimation fails often due to complex conditions such as motion blur, low-resolution, occlusion, crowded scenes, and fast motion. Second, there exist cues about the action context that are difficult to simulate, such as object interactions, bias towards certain clothing or environments for certain actions. Assuming that body motions alone, even when noisy, provide discriminative information for actions, we augment the 15 training videos of one-shot Kinetics-15 subset synthetically using HMMR (see Fig. 8) by rendering at 5 viewpoints (0\(^\circ \), 30\(^\circ \), 45\(^\circ \), 315\(^\circ \), 330\(^\circ \)).

Table 9 One-shot Kinetics-15: Real training data consists of 1 training sample per category, i.e., 15 videos. Random chance and nearest neighbor rows present baseline performances for this setup. We augment each training video with 5 different viewpoints by synthetically rendering SMPL sequences extracted from real data (i.e., 75 videos), blended on random backgrounds from the Mini-Kinetics training videos and obtain 6.5% improvement over training only with real data. For the last 4 rows, we train only the last linear layer of the ResNeXt-101 3D CNN model pre-trained on Mini-Kinetics 200 classes

We use a pre-trained feature extractor model and only train a linear layer from the features to the 15 classes. We observe over-fitting with higher-capacity models due to limited one-shot training data. We experiment with two pre-trained models, obtained from Crasto et al. (2019): RGB and flow. The models follow the 3D ResNeXt-101 architecture from Hara et al. (2018) and are pre-trained on Mini-Kinetics-200 categories with \(16 \times 112 \times 112\) resolution with consecutive frame sampling.

In Table 9 (top), we first provide simple baselines: nearest neighbor with pre-trained features is slightly above random chance (8.6% vs 6.7% for RGB). Table 9 (bottom) shows training linear layers. Using only synthetic data obtains poor performance (9.4%). Training only with real data on the other hand obtains 26.2%, which is our baseline performance. We obtain \(\sim \)6% improvement by adding synthetic data. We also experiment with static background images from the LSUN dataset (Yu et al. 2015) and note the importance of realistic noisy backgrounds for generalization to in-the-wild videos.

5 Conclusions

We presented an effective methodology for automatically augmenting action recognition datasets with synthetic videos. We explored the importance of different variations in the synthetic data, such as viewpoints and motions. Our analysis emphasizes the question on how to diversify motions within an action category. We obtain significant improvements for action recognition from unseen viewpoints and one-shot training. However, our approach is limited by the performance of the 3D pose estimation, which can fail in cluttered scenes. Possible future directions include action-conditioned generative models for motion sequences and simulation of contextual cues for action recognition.