Synthetic Humans for Action Recognition from Unseen Viewpoints

Varol, Gül; Laptev, Ivan; Schmid, Cordelia; Zisserman, Andrew

doi:10.1007/s11263-021-01467-7

Synthetic Humans for Action Recognition from Unseen Viewpoints

Open access
Published: 12 May 2021

Volume 129, pages 2264–2287, (2021)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Synthetic Humans for Action Recognition from Unseen Viewpoints

Download PDF

Gül Varol ORCID: orcid.org/0000-0002-8438-6152¹,
Ivan Laptev²,
Cordelia Schmid² &
…
Andrew Zisserman³

6019 Accesses
55 Citations
10 Altmetric
Explore all metrics

Abstract

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. Our goal in this work is to answer the question whether synthetic humans can improve the performance of human action recognition, with a particular focus on generalization to unseen viewpoints. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (1) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (2) We introduce a new data generation methodology, SURREACT, that allows training of spatio-temporal CNNs for action classification; (3) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (4) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.

Multi-view Action Recognition Using Cross-View Video Prediction

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

Test-Time Adaptation for Egocentric Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Learning human action representations from RGB video data has been widely studied. Recent advances on convolutional neural networks (CNNs) (LeCun et al. 1989) have shown excellent performance (Carreira and Zisserman 2017; Feichtenhofer et al. 2019, 2016; Hara et al. 2018; Lin et al. 2019; Varol et al. 2018; Wang et al. 2016) on benchmark datasets, such as UCF101 (Soomro et al. 2012). However, the success of CNNs rely heavily on the availability of large-scale training data, which is not always the case. To address the lack of training data, several works explore the use of complementary synthetic data for a range of tasks in computer vision such as optical flow estimation, segmentation, human body and hand pose estimation (Dosovitskiy et al. 2015; Shotton et al. 2011; Su et al. 2015; Varol et al. 2017; Zimmermann and Brox 2017). In this work, we raise the question how to synthesize videos for action recognition in the case of limited real data, such as only one viewpoint, or one-shot available at training.

Imagine a surveillance or ambient assisted living system, where a dataset is already collected for a set of actions from a certain camera. Placing a new camera in the environment from a new viewpoint would require re-annotating data because the appearance of an action is drastically different when performed from different viewpoints (Junejo et al. 2011; Liu et al. 2011; Zheng et al. 2016). In fact, we observe that state-of-the-art action recognition networks fail drastically when trained and tested on distinct viewpoints. Specifically, we train the model of Hara et al. (2018) on videos from a benchmark dataset NTU RGB+D (Shahroudy et al. 2016) where people are facing the camera. When we test this network on other front-view (0\(^\circ \)) videos, we obtain \(\sim \)80% accuray. When we test with side-view (90\(^\circ \)) videos, the performance drops to \(\sim \)40% (see Sect. 4). This motivates us to study action recognition from novel viewpoints.

Existing methods addressing cross-view action recognition do not work in challenging setups (e.g. same subjects and similar viewpoints in training and test splits (Shahroudy et al. 2016)). We introduce and study a more challenging protocol with only one viewpoint at training. Recent methods assuming multi-view training data (Li et al. 2018a, b; Wang et al. 2018) also become inapplicable.

A naive way to achieve generalization is to collect data from all views, for all possible conditions, but this is impractical due to combinatorial explosion (Yuille et al. 2018. Instead, we augment the existing real data synthetically to increase the diversity in terms of viewpoints, appearance, and motions. Synthetic humans are relatively easy to render for tasks such as pose estimation, because arbitrary motion capture (MoCap) resource can be used (Shotton et al. 2011; Varol et al. 2017). However, action classification requires certain motion patterns and semantics. It is challenging to generate synthetic data with action labels (De Souza et al. 2017). Typical MoCap datasets (CMU Mocap Database), targeted for pose diversity, are not suitable for action recognition due to lack of clean action annotations. Even if one collects a MoCap dataset, it is still limited to pre-defined set of categories.

In this work, we propose a new, efficient and scalable approach for generating synthetic videos with action labels from the target set of categories. We employ a 3D human motion estimation method, such as HMMR (Kanazawa et al. 2019) and VIBE (Kocabas et al. 2020), that automatically extracts the 3D human dynamics from a single-view RGB video. The resulting sequence of SMPL body (Loper et al. 2015) pose parameters are then combined with other randomized generation components (e.g. viewpoint, clothing) to render diverse complementary training data with action annotations. Figure 1 presents an overview of our pipeline. We demonstrate the advantages of such data when training spatio-temporal CNN models for (1) action recognition from unseen viewpoints and (2) training with one-shot real data. We boost performance on unseen viewpoints from 53.6 to 69.0% on NTU, and from 49.4 to 66.4% on UESTC dataset by augmenting limited real training data with our proposed SURREACT dataset. Furthermore, we present an in-depth analysis about the importance of action relevant augmentations such as diversity of motions and viewpoints, as well as our non-uniform frame sampling strategy which substantially improves the action recognition performance. Our code and data will be available at the project page^{Footnote 1}.

2 Related Work

Human action recognition is a well-established research field. For a broad review of the literature on action recognition, see the recent survey of Kong et al. Kong and Fu (2018). Here, we focus on relevant works on synthetic data, cross-view action recognition, and briefly on 3D human shape estimation.

Synthetic Humans. Simulating human motion dates back to 1980s. Badler et al. (1993) provide an extensive overview of early approaches. More recently, synthetic images of people have been used to train visual models for 2D/3D body pose and shape estimation (Chen et al. 2016; Ghezelghieh et al. 2016; Liu et al. 2019a; Pishchulin et al. 2012; Shotton et al. 2011; Varol et al. 2018), part segmentation (Shotton et al. 2011; Varol et al. 2017), depth estimation (Varol et al. 2017), multi-person pose estimation (Hoffmann et al. 2019), pedestrian detection (Marin et al. 2010; Pishchulin et al. 2012), person re-identification (Qian et al. 2018), hand pose estimation (Hasson et al. 2019; Zimmermann and Brox 2017), and face recognition (Kortylewski et al. 2018; Masi et al. 2019). Synthetic datasets built for these tasks, such as the recent SURREAL dataset (Varol et al. 2017), however, do not provide action labels.

Among previous works that focus on synthetic human data, very few tackle action recognition (De Souza et al. 2017; Liu et al. 2019b; Rahmani and Mian 2016). Synthetic 2D human pose sequences (Lv and Nevatia 2007) and synthetic point trajectories (Rahmani and Mian 2015; Rahmani et al. 2018; Jingtian et al. 2018) have been used for view-invariant action recognition. However, RGB-based synthetic training for action recognition is relatively new, with (De Souza et al. 2017) being one of the first attempts. De Souza et al. (2017) manually define 35 action classes and jointly estimate real categories and synthetic categories in a multi-task setting. However, their categories are not easily scalable and do not necessarily relate to the target set of classes. Unlike (De Souza et al. 2017), we automatically extract motion sequences from real data, making the method flexible for new categories. Recently, (Puig et al. 2018) has generated the VirtualHome dataset, a simulation environment with programmatically defined synthetic activities using crowd-sourcing. Different than our work, the focus of Puig et al. (2018) is not generalization to real data.

Most relevant to ours, (Liu et al. 2019b) generates synthetic training images to achieve better performance on unseen viewpoints. The work of Liu et al. (Liu et al. 2019b) is an extension of Rahmani and Mian (2016) by using RGB-D as input instead of depth only. Both works formulate a frame-based pose classification problem on their synthetic data, which they then use as features for action recognition. These features are not necessarily discriminative for the target action categories. Different than this direction, we explicitly assign an action label to synthetic videos and define the supervision directly on action classification.

Cross-View Action Recognition. Due to the difficulty of building multi-view action recognition datasets, the standard benchmarks have been recorded in controlled environments. RGB-D datasets such as IXMAS (Weinland et al. 2007), UWA3D II (Rahmani et al. 2016) and N-UCLA (Wang et al. 2014) were state of the art until the availability of the large-scale NTU RGB+D dataset (Shahroudy et al. 2016). The size of NTU allows training deep neural networks unlike previous datasets. Very recently, Ji et al. (Ji et al. 2018) collected the first large-scale dataset, UESTC, that has a 360\(^\circ \) coverage around the performer, although still in a lab setting.

Since multi-view action datasets are typically captured with depth sensing devices, such as Kinect, they also provide an accurate estimate of the 3D skeleton. Skeleton-based cross-view action recognition therefore received a lot of attention in the past decade (Ke et al. 2017; Liu et al. 2016, 2017a, b; Zhang et al. 2017). Variants of LSTMs (Hochreiter and Schmidhuber 1997) have been widely used (Liu et al. 2016, 2017a; Shahroudy et al. 2016). Recently, spatio-temporal skeletons were represented as images (Ke et al. 2017) or higher dimensional objects (Liu et al. 2017b)where standard CNN architectures were applied.

RGB-based cross-view action recognition is in comparison less studied. Transforming RGB features to be view-invariant is not as trivial as transforming 3D skeletons. Early work on transferring appearance features from the source view to the target view explored the use of maximum margin clustering to build a joint codebook for temporally synchronous videos Farhadi and Tabrizi 2008. Following this approach, several other works focused on building global codebooks to extract view-invariant representations (Kong et al. 2017; Liu et al. 2019c; Rahmani et al. 2018; Zheng and Jiang 2013; Zheng et al. 2016). Recently, end-to-end approaches used human pose information as guidance for action recognition (Baradel et al. 2017; Liu and Yuan 2018; Luvizon et al. 2018; Zolfaghari et al. 2017). Li et al. (2018a) formulated an adversarial view-classifier to achieve view-invariance. Wang et al. (Wang et al. 2018) proposed to fuse view-specific features from a multi-branch CNN. Such approaches cannot handle single-view training (Li et al. 2018a; Wang et al. 2018). Our method differs from these works by compensating for the lack of view diversity with synthetic videos. We augment the real data automatically at training time, and our model does not involve any extra cost at test time unlike (Wang et al. 2018). Moreover, we do not assume real multi-view videos at training.

3D Human Shape Estimation. Recovering the full human body mesh from a single image has been explored as a model-fitting problem (Bogo et al. 2016; Lassner et al. 2017), as regressing model parameters with CNNs (Kanazawa et al. 2018; Omran et al. 2018; Pavlakos et al. 2018; Tung et al. 2017), and as regressing non-parametric representations such as graphs or volumes (Kolotouros et al. 2019; Varol et al. 2018). Recently, CNN-based parameter regression approaches have been extended to video (Kanazawa et al. 2019; Liu et al. 2019a; Kocabas et al. 2020). HMMR (Kanazawa et al. 2019) builds on the single-image-based HMR (Kanazawa et al. 2018) to learn the human dynamics by using 1D temporal convolutions. More recently, VIBE (Kocabas et al. 2020) adopts a recurrent model based on frame-level pose estimates provided by SPIN (Kolotouros et al. 2019). VIBE also incorporates an adversarial loss that penalizes the estimated pose sequence if it is not a ‘realistic’ motion, i.e., indistinguishable from the real AMASS (Mahmood et al. 2019) MoCap sequences. In this work, we recover 3D body parameters from real videos using HMMR (Kanazawa et al. 2019) and VIBE (Kocabas et al. 2020). Both methods employ the SMPL body model (Loper et al. 2015). We provide a comparison between the two methods for our purpose of action recognition, which can serve as a proxy task to evaluate motion estimation.

3 Synthetic Humans with Action Labels

Our goal is to improve the performance of action recognition using synthetic data in cases where the real data is limited, e.g. domain mismatch between training/test such as viewpoints or low-data regime. In the following, we describe the three stages of: (1) obtaining 3D temporal models for human actions from real training sequences (at a particular viewpoint) (Sect. 3.1); (2) using these 3D temporal models to generate training sequences for new (and the original) viewpoints using a rendering pipeline with augmentation (Sect. 3.2); and (3) training a spatio-temporal CNN with both real and synthetic data (Sect. 3.3).

3.1 3D Human Motion Estimation

In order to generate a synthetic video with graphics techniques, we need to have a sequence of articulated 3D human body models. We employ the parametric body model SMPL (Loper et al. 2015), which is a statistical model, learned over thousands of 3D scans. SMPL generates the mesh of a person given the disentangled pose and shape parameters. The pose parameters (\(\mathbb {R}^{72}\)) control the kinematic deformations due to skeletal posture, while the shape parameters (\(\mathbb {R}^{10}\)) control identity-specific deformations such as the person height.

We hypothesize that a human action can be captured by the sequence of pose parameters, and that the shape parameters are largely irrelevant (note, this may not necessarily be true for human-object interaction categories). Given reliable 3D pose sequences from action recognition video datasets, we can transfer the associated action labels to synthetic videos. We use the recent method of Kanazawa et al. (Kanazawa et al. 2019), namely human mesh and motion recovery (HMMR), unless stated otherwise. HMMR extends the single-image reconstruction method HMR (Kanazawa et al. 2018) to video with a multi-frame CNN that takes into account a temporal neighborhood around a video frame. HMMR learns a temporal representation for human dynamics by incorporating large-scale 2D pseudo-ground truth poses for in-the-wild videos. It uses PoseFlow (Zhang et al. 2018)and AlphaPose (Fang et al. 2017) for multi-person 2D pose estimation and tracking as a pre-processing step. Each person crop is then given as input to the CNN for estimating the pose and shape, as well as the weak-perspective camera parameters. We refer the reader to (Kanazawa et al. 2019)for more details. We choose this method for the robustness on in-the-wild videos, ability to capture multiple people, and the smoothness of the recovered motion, which are important for our generalization from synthetic videos to real. Figure 1 presents the 3D pose animated synthetically for sample video frames. We also experiment with the more recent motion estimation method, VIBE (Kocabas et al. 2020), and show that improvements in motion estimation proportionally affect the action recognition performance in our pipeline. Note that we only use the pose parameters from HMMR or VIBE, and randomly change the shape parameters, camera parameters, and other factors. Next, we present the augmentations in our synthetic data generation.

3.2 SURREACT Dataset Components

In this section, we give details on our synthetic dataset, SURREACT (Synthetic hUmans foR REal ACTions).

We follow (Varol et al. 2017) and render 3D SMPL sequences with randomized cloth textures, lighting, and body shapes. We animate the body model with our automatically extracted pose dynamics as described in the previous section. We explore various motion augmentation techniques to increase intra-class diversity in our training videos. We incorporate multi-person videos which are especially important for two-people interaction categories. We also systematically sample from 8 viewpoints around a circle to perform controlled experiments. Different augmentations are illustrated in Fig. 2 for a sample synthetic frame. Visualizations from SURREACT are further provided in Fig. 3.

Each generated video has automatic ground truth for 3D joint locations, part segmentation, optical flow, and SMPL body (Loper et al. 2015) parameters, as well as an action label, which we use for training a video-based 3D CNN for action classification. We use other ground truth modalities as input to action recognition as oracle experiments (see Table 14).We further use the optical flow ground truth to train a flow estimator and use the segmentation to randomly augment the background pixels in some experiments.

Our new SURREACT dataset differs from the SURREAL dataset (Varol et al. 2017) mainly by providing action labels, exploring motion augmentation, and by using automatically extracted motion sequences instead of MoCap recordings (CMU Mocap Database). Moreover, Varol et al. (Varol et al. 2017) do not exploit the temporal aspect of their dataset, but only train CNNs with single-image input. We further employ multi-person videos and a systematic viewpoint distribution.

Motion Augmentation. Automatic extraction of 3D sequences from 2D videos poses an additional challenge in our dataset compared to clean high-quality MoCap sequences. To reduce the jitter, we temporally smooth the estimated SMPL pose parameters by weighted linear averaging. SMPL poses are represented as axis-angle rotations between joints. We convert them into quaternions when we apply linear operations, then normalize each quaternion to have a unit norm, before converting back to axis-angles. Even with this processing, the motions may remain noisy, which is inevitable given that monocular 3D motion estimation is a difficult task on its own. Our findings interestingly suggest that the synthetic human videos are still beneficial when the motions are noisy.

To increase motion diversity, we further perturb the pose parameters with various augmentations. Specifically, we use a video-level additive noise on the quaternions for each body joint to slightly change the poses, as an intra-individual augmentation. We also experiment with an inter-individual augmentation by interpolating between motion sequences of the same action class. Given a pair of sequences from two individuals, we first align them with dynamic time warping (Sakoe and Chiba 1978), then we linearly interpolate the quaternions of the time-aligned sequences to generate a new sequence, which we refer as interpolation. A visual explanation of the process can be found in We show significant gains by increasing motion diversity.

Multi-person. We use the 2D pose information from (Fang et al. 2017; Zhang et al. 2018) to count the number of people in the real video. In the case of a single-person, we center the person on the image and do not add 3D translation to the body, i.e., the person is centered independently for each frame. While such constant global positioning of the body loses information for some actions such as walking and jumping, we find that the translation estimate adds more noise to consider this information and potentially increases the domain gap with the real where no such noise exists (see Appendix A). If there is more than one person, we insert additional body model(s) for rendering. We translate each person in the xy image plane. Note that we do not translate the person in full xyz space. We observe that the z component of the translation estimation is not reliable due to the depth ambiguity therefore the people are always centered at \(z=0\). More explanations about the reason for omitting the z component can be found in Appendix A.We temporally smooth the translations to reduce the noise. We subtract the mean of translations across the video and across the people to roughly center all people to the frame. We therefore keep the relative distances between people, which is important for actions such as walking towards each other.

Viewpoints. We systematically render each motion sequence 8 times by randomizing all other generation parameters at each view. In particular, we place the camera to be rotated at {\(0^{\circ }, 45^{\circ }, 90^{\circ }, 135^{\circ }, 180^{\circ }, 225^{\circ }, 270^{\circ },\) \(315^{\circ }\)} azimuth angles with respect to the origin, denoted as (\(0^\circ \):\(45^\circ \):\(360^\circ \)) in our experiments. The distance of the camera from the origin and the height of the camera from the ground are randomly sampled from a predefined range: [4, 6] meters for the distance, \([-1, 3]\) meters for the height. This can be adjusted according to the target test setting.

Backgrounds. Since we have access to the target real dataset where we run pose estimation methods, we can extract background pixels directly from the training set of this dataset. We crop from regions without the person to obtain static backgrounds for the NTU and UESTC datasets. We experimentally show the benefits of using the target dataset backgrounds in the Appendix (see Table 15).For Kinetics experiments, we render human bodies on top of unconstrained videos from non-overlapping action classes and show benefits over static backgrounds. Note that these background videos might also include human pixels.

3.3 Training 3D CNNs with Non-Uniform Frames

Following the success of 3D CNNs for video recognition (Carreira and Zisserman 2017; Hara et al. 2018; Tran et al. 2015), we employ a spatio-temporal convolutional architecture that operates on multi-frame video inputs. Unless otherwise specified, our network architecture is 3D ResNet-50 (Hara et al. 2018) and its weights are randomly initialized (see Appendix B.4 for pretraining experiments).

To study the generalization capability of synthetic data across different input modalities, we train one CNN for RGB and another for optical flow as in Simonyan and Zisserman (2014). We average the scores with equal weights when reporting the fusion.

We subsample fixed-sized inputs from videos to have a \(16\times 256\times 256\) spatio-temporal resolution, in terms of number of frames, width, and height, respectively. In case of optical flow input, we map the RGB input to \(15\times 64\times 64\) dimensional flow estimates. To estimate flow, we train a two-stack hourglass architecture (Newell et al. 2016) with our synthetic data for flow estimation on 2 consecutive frames. We refer the reader to Figure 10 for the qualitative results of our optical flow estimation.

Non-Uniform Frame Sampling. We adopt a different frame sampling strategy than most works (Carreira and Zisserman 2017; Feichtenhofer et al. 2019; Hara et al. 2018) in the context of 3D CNNs. Instead of uniformly sampling (at a fixed frame rate) a video clip with consecutive frames, we randomly sample frames across time by keeping their temporal order, which we refer as non-uniform sampling. Although recent works explore multiple temporal resolutions, e.g. by regularly sampling at two different frame rates (Feichtenhofer et al. 2019), or randomly selecting a frame rate (Zhu and Newsam 2018), the sampled frames are equidistant from each other. TSN (Wang et al. 2016) and ECO (Zolfaghari et al. 2018) employ a hybrid strategy by regularly sampling temporal segments and randomly sampling a frame from each segment, which is a more restricted special case of our strategy. Moreover, TSN uses a 2D CNN without temporal modelling. Zolfaghari et al. (2018) also has 2D convolutional features on each frame, which are stacked as input to a 3D CNN only at the end of the network. None of these works provide controlled experiments to quantify the effect of their sampling strategy. The concurrent work of Chen et al. (2020) presents an experimental analysis comparing the dense consecutive sampling with the hybrid sampling of TSN.

Figure 4 compares the consecutive sampling with our non-uniform sampling. In our experiments, we report results for both and show improvements for the latter. Our videos are temporally trimmed around the action, therefore, each video is short, i.e. spans several seconds. During training we randomly sample 16 video frames as a fixed-sized input to 3D CNN. Thus, the convolutional kernels become speed-invariant to some degree. This can be seen as a data augmentation technique, as well as a way to capture long-term cues.

At test time, we sample several 16-frame clips and average the softmax scores. If we test the uniform case, we sample non-overlapping consecutive clips with sliding window. For the non-uniform case, we randomly sample as many non-uniform clips as the number of sliding windows for the uniform case. In other words, the number of sampled clips is proportional to the video length. More precisely, let T be the number of frames in the entire test video, F be the number of input frames per clip, and S be the stride parameter. We sample N clips where \(N = \left\lceil (T - F) / S)\right\rceil + 1\). In our case \(F=16\), \(S=16\). We apply sliding window for the uniform case. For the non-uniform case, we sample N clips, where each clip is an ordered random (without replacement) 16-frame subset from T. We observe that it is important to train and test with the same sampling scheme, and keeping the temporal order is important. More details can be found in Appendix B.5.

Synth+Real. Since each real video is augmented multiple times (e.g. 8 times for 8 views), we have more synthetic data than real. When we add synthetic data to training, we balance the real and synthetic datasets such that at each epoch we randomly subsample from the synthetic videos to have equal number for both real and synthetic.

We minimize the cross-entropy loss using RMSprop (Tieleman and Hinton 2012) with mini-batches of size 10 and an initial learning rate of \(10^{-3}\) with a fixed schedule. Color augmentation is used for the RGB stream. Other implementation details are given in Appendix A.

Table 1 Training jointly on synthetic and real data substantially boosts the performance compared to only real training on NTU CVS protocol, especially on unseen views (\(45^\circ \), \(90^\circ \)) (e.g., 69.0% vs 53.6%). The improvement can be seen for both RGB and Flow streams, as well as the fusion. We note the marginal improvements with the addition of flow unlike in other tasks where flow has been used to reduce the synthetic-real domain gap (Doersch and Zisserman 2019). We render two different versions of the synthetic dataset using HMMR and VIBE motion estimation methods, and observe improvements with VIBE. Moreover, training on synthetic videos alone is able to obtain 63.0% accuracy

Synthetic Humans for Action Recognition from Unseen Viewpoints

Abstract

Similar content being viewed by others

Multi-view Action Recognition Using Cross-View Video Prediction

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

Test-Time Adaptation for Egocentric Action Recognition

Explore related subjects

1 Introduction

2 Related Work

3 Synthetic Humans with Action Labels

3.1 3D Human Motion Estimation

3.2 SURREACT Dataset Components

3.3 Training 3D CNNs with Non-Uniform Frames

4 Experiments

4.1 Datasets and Evaluation Protocols

4.2 Ablation Study

4.3 Comparison with the State of the Art

4.4 One-Shot Training

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

APPENDIX

Additional Details

Additional Analyses

1.1 Synthetic-Only Training

1.2 Synthetic+Real Training

1.3 Performance Breakdown for Object-Related Actions

1.4 Pretraining on Kinetics

1.5 Non-Uniform Frame Sampling

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation