Learning Multi-human Optical Flow

The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single- and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.


Introduction
A significant fraction of videos on the Internet contain people moving [4] and the literature suggests that optical flow plays an important role in understanding human action [5,6].Several action recognition datasets [6,7] contain human motion as a major component.The 2D motion of humans in video, or human optical flow, is an important feature that provides a building block for systems that can understand and interact with humans.Human optical flow is useful for various applications including analyzing pedestrians in road sequences, motion-controlled gaming, activity recognition, human pose estimation system, etc.
Despite this, optical flow has previously been treated as a generic, low-level, vision problem.Given the importance of people, and the value of optical flow in understanding them, we develop a dataset and trained models that are specifically tailored to humans and their motion.Such motions are nontrivial since humans are complex, articulated objects that vary in shape, size and appearance.They move quickly, adopt a wide range of poses, and self-occlude or occlude in multiperson scenarios.
Our goal is to obtain more accurate 2D motion estimates for human bodies by training a flow algorithm specifically for human movement.To do so, we create a large and realistic dataset of humans moving in virtual worlds with ground truth optical flow (Fig. 1(a)), called the Human Optical Flow dataset.This is comprised of two parts; the Single-Human Optical Flow dataset (SHOF), where the image sequences contain only one person in motion and the Multi-Human Optical Flow dataset (MHOF) where images contain multiple people involving significant occlusion between them.We analyse the performance of SPyNet [2] and PWC-Net [3] by training (fine-tuning) them on both the SHOF and MHOF dataset.We observe that the optical flow performance of the networks improves on sequences containing human scenes, We simulate human motion in virtual worlds creating an extensive dataset with images (top row) and flow fields (bottom row); color coding from [1].(b) We train SPyNet [2] and PWC-Net [3] for human motion estimation and show that they performs better when trained on our dataset and (c) can generalize to human motions in real world scenes.Columns show single-person and multi-person cases alternately.
both qualitatively and quantitatively.Furthermore we show that the trained networks generalize to real video sequences (Fig. 1(c)).Several datasets and benchmarks [1,8,9] have been established to drive the progress in optical flow.We argue that these datasets are insufficient for the task of human motion estimation and, despite its importance, no attention has been paid to datasets and models for human optical flow.One of the main reasons is that dense human motion is extremely difficult to capture accurately in real scenes.Without ground truth, there has been little work focused specifically on estimating human optical flow.To advance research on this problem, the community needs a dataset tailored to human optical flow.
A key observation is that recent work has shown that optical flow methods trained on synthetic data [2,10,11] generalize relatively well to real data.Additionally, these methods obtain state-of-the-art results with increased realism of the training data [12,13].This motivates our effort to create a dataset designed for human motion.
To that end, we use the SMPL [14] and SMPL+H [15] models, that capture the human body alone and the body together with articulated hands respectively, to generate different human shapes including hand and finger motion.We then place humans on random indoor backgrounds and simulate human activities like running, walking, dancing etc. using motion capture data [16,17].Thus, we create a large virtual dataset that captures the statistics of natural human motion in multi-person scenarios.We then train on this deep neural networks and evaluate their performance for estimating human motion.While the dataset can be used to train any flow method, we focus specifically on networks based on spatial pyramids, namely SpyNet [2] and PWC-Net [3], because they are compact and computationally efficient.
A preliminary version of this work appeared in [18] that presented a dataset and model for human optical flow for the single-person case with a body-only model.The present work extends [18] for the multi-person case, as images with multiple occluding people have different statistics.It further employs a holistic model of the body together with hands for more realistic motion variation.This work also extends training SPyNet [2] and PWC-Net [3] using the new dataset in contrast to training only SPyNet in the earlier work [18].Our experiments show both qualitative and quantitative improvements.
In summary, our major contributions in this extended work are: 1) We provide the Single-Human Optical Flow dataset (SHOF) of human bodies in motion with realistic textures and backgrounds, having 146, 020 frame pairs for single-person scenarios.2) We provide the Multi-Human Optical Flow dataset (MHOF), with 111, 312 frame pairs of multiple human bodies in motion, with improved textures and realistic visual occlusions, but without (self-)collisions or intersection of body meshes.These two datasets together comprise the Human Optical Flow dataset.3) We fine-tune SPyNet [18] on SHOF and show that its performance improves by about 43% (over the initial SPyNet), while it also outperforms existing state of the art by about 30%.Furthermore, we fine-tune SPyNet and PWC-Net on MHOF and observe improvements of 10 − 20% (over the initial SPyNet and PWC-Net).Compared to existing state of the art, improvements are particularly high for human regions.After masking out the background, we observe improvements of up to 13% for human pixels.4) We provide the dataset files, dataset rendering code, training code and trained models1 for research purposes.

Related Work
Human Motion.Human motion can be understood from 2D motion.Early work focused on the movement of 2D joint locations [19] or simple motion history images [20].Optical flow is also a useful cue.Black et al. [21] use principal component analysis (PCA) to parametrize human motion but use noisy flow computed from image sequences for training data.More similar to us, Fablet and Black [22] use a 3D articulated body model and motion capture data to project 3D body motion into 2D optical flow.They then learn a viewbased PCA model of the flow fields.We use a more realistic body model to generate a large dataset and use this to train a CNN to directly estimate dense human flow from images.
Only a few works in pose estimation have exploited human motion and, in particular, several methods [23,24] use optical flow constraints to improve 2D human pose estimation in videos.Similar work [25,26] propagates pose results temporally using optical flow to encourage time consistency of the estimated bodies.Apart from its application in warping between frames, the structural information existing in optical flow alone has been used for pose estimation [27] or in conjunction with an image stream [28,29].
Learning Optical Flow.There is a long history of optical flow estimation, which we do not review here.Instead, we focus on the relatively recent literature on learning flow.Early work looked at learning flow using Markov Random Fields [30], PCA [31] , or shallow convolutional models [32].Other methods also combine learning with traditional approaches, formulating flow as a discrete [33] or continuous [34] optimization problem.
The most recent methods employ large datasets to estimate optical flow using deep neural networks.Voxel2Voxel [35] is based on volumetric convolutions to predict optical flow using 16 frames simultaneously but does not perform well on benchmarks.Other methods [2,10,11] compute two frame optical flow using an end-to-end deep learning approach.FlowNet [10] uses the Flying Chairs dataset [10] to compute optical flow in an end-to-end deep network.FlowNet 2.0 [11] uses stacks of networks from FlowNet and performs significantly better, particularly for small motions.Ranjan and Black [2] propose a Spatial Pyramid Network that employs a small neural network on each level of an image pyramid to compute optical flow.Their method uses a much smaller number of parameters and achieves similar performance as FlowNet [10] using the same training data.Sun et al. [3] use image features in a similar spatial pyramid network achieving state-of-the-art results on optical flow benchmarks.Since the above methods are not trained with human motions, they do not perform well on our Human Optical Flow dataset.
Optical Flow Datasets.Several datasets have been developed to facilitate training and benchmarking of optical flow methods.Middlebury is limited to small motions [1], KITTI is focused on rigid scenes and automotive motions [8], while Sintel has a limited number of synthetic scenes [9].These datasets are mainly used for evaluation of optical flow methods and are generally too small to support training neural networks.
To learn optical flow using neural networks, more datasets have emerged that contain examples on the order of tens of thousands of frames.The Flying Chairs [10] dataset contains about 22, 000 samples of chairs moving against random backgrounds.Although it is not very realistic or diverse, it provides training data for neural networks [2,10] that achieve reasonable results on optical flow benchmarks.Even more recent datasets [12,13] for optical flow are especially designed for training deep neural networks.Flying Things [12] contains tens of thousands of samples of random 3D objects in motion.The Monkaa and Driving scene datasets [12] contain frames from animated scenes and virtual driving respectively.Virtual KITTI [13] uses graphics to generate scenes like those in KITTI and is two orders of magnitude larger.Recent synthetic datasets [36] show that synthetic data can train networks that generalize to real scenes.
For human bodies, the SURREAL dataset [37] uses 3D human meshes rendered on top of color images to train networks for depth estimation, and body part segmentation.While not fully realistic, they show that this data is sufficient to train methods that generalize to real data.In a similar fashion, [38,39] and [40] render synthetic color images for 3D hand pose estimation and 3D hand-object reconstruction, accordingly.We go beyond these works to address the problem of optical flow.

The Human Optical Flow Dataset
Our approach generates a realistic dataset of synthetic human motions by simulating them against different realistic backgrounds.We use parametric models [15,41] to generate synthetic humans with a wide variety of different human shapes.We employ Blender2 and its Cycles rendering engine to generate realistic synthetic image frames and optical flow.In this way we create the Human Optical Flow dataset, that is comprised of two parts.We first create the Single-Human Optical Flow (SHOF) dataset [18] using the body-only SMPL model [41] in images containing a single synthetic human.However, image statistics are different for the single-and multi-person case, as multiple people tend to occlude each other in complicated ways.For this reason we then create the Multi-Human Optical Flow (MHOF) dataset to better capture this realistic interaction.To make images even more realistic for MHOF, we replace SMPL [41] with the SMPL+H [15] model that models the body together with articulated fingers, Fig. 2 Pipeline for generating the RGB frames and ground truth optical flow for the Multi-Human Optical Flow dataset.The datasets used in this pipeline are listed in Table 1, while the various rendering component are summarized in Table 2.
to have richer motion variation.In the rest of this section, we describe the components of our rendering pipeline, shown in Figure 2.For easy reference, in Table 1 we summarize the data used to generate the SHOF and MHOF datasets, while in Table 2 we summarize the various tools, Blender passes and parameters used for rendering.In the rest of the section, we describe the modules used for generating the data.

Human Body Generation
Body Model.A parametrized body model is necessary to generate human bodies in a scene.In the SHOF dataset, we use SMPL [41] for generating human body shapes.For the MHOF dataset we, use SMPL+H [15] that parametrizes the human body together with articulated fingers, for increased realism.The models are parameterized by pose and shape parameters to change the body posture and identity, as shown in Figure 2.They also contain a UV appearance map that allows us to change the skin tone, face features and clothing texture of the resulting virtual humans.
Body Poses.The next step is articulating the human body with different poses, to create moving sequences.To find such poses, we use 3D MoCap datasets [42,43,44] that capture 3D MoCap marker positions, glued onto the skin surface of real human subjects.We then employ MoSh [16,17] that fits our body model to these 3D markers by optimizing over parameters of the body model for articulated pose, translation and shape.The pose specifically is a vector of axis-angle parameters, that describes how to rotate each body part around its corresponding skeleton joint.
For the MHOF dataset, we use the CMU [43] and Hu-manEva [44] MoCap datasets to increase motion variation.From CMU MoCap dataset, we use 2, 605 sequences of 23 high-level action categories.From the HumanEva dataset, we use more than 10 sequences performing actions from 6 different action categories.To reduce redundant poses and allow for larger motions between frames, sequences are subsampled to 12 fps resulting in 321, 873 poses.As a result the final MHOF dataset has 254, 211 poses for training, 32, 670 for validation and 34, 992 for testing.
Hand Poses.Traditionally MoCap systems and datasets [42,43,44] record the motion of body joints, and avoid the tedious capture of detailed hand and finger motion.However, in natural settings, people use their body, hands and fingers to communicate social cues and to interact with the physical world.To enable our methods to learn such subtle motions, it should be represented in our training data.Therefore, we use the SMPL+H model [15] and augment the body-only Mo-Cap datasets, described above, with finger motion.Instead of using random finger poses that would generate unrealistic optical flow, we employ the Embodied Hands dataset [15] and sample continuous finger motion to generate realistic optical flow.We use 43 sequences of hand motion with 37, 232 frames recorded at 60 Hz by [15].Similarly to body MoCap, we subsample hand MoCap to 12 fps to reduce overlapping poses without sacrificing variability.
Body Shapes.Human bodies vary a lot in their proportions, since each person has a unique body shape.To represent this in our dataset, we first learn a gender specific Gaussian distribution of shape parameters, by fitting SMPL to 3D CAESAR scans [45] of both genders.We then sample random body shapes from this distribution to generate a large number of realistic body shapes for rendering.However, naive sampling can result in extreme and unrealistic shape parameters, therefore we bound the shape distribution to avoid unlikely shapes.
For the SHOF dataset we bound the shape parameters to the range of [−3, 3] standard deviations for each shape coefficient and draw a new shape for every subsequence of 20 frames to increase variance.
For the MHOF dataset, we account explicitly for collisions and intersections, since intersecting virtual humans would result in generation of inaccurate optical flow.To minimize such cases, we use similar sampling as above with only small differences.We first use shorter subsequences of 10 frames for less frequent inter-human intersections.Furthermore, we bound the shape distribution to the narrower range of [−2.7, 2.7] standard deviations, since re-targeting motion to unlikely body shapes is more prone to mesh selfintersections.
Body Texture.We use the CAESAR dataset [45] to generate a variety of human skin textures.Given SMPL registrations to CAESAR scans, the original per-vertex color in the CAESAR dataset is transferred into the SMPL texture map.Since fiducial markers were placed on the bodies of CAESAR subjects, we remove them from the textures and inpaint them to produce a natural texture.In total, we use 166 CAESAR textures that are of good quality.The main drawback of CAESAR scans is their homogeneity in terms of outfit, since all of the subjects wore grey shorts and the women wore sports bras.In order to increase the clothing variety, we also use textures extracted from our 3D scans (referred as non-CAESAR in the following), to which we register SMPL with 4Cap [51].A total of 772 textures from 7 different subjects with different clothes were captured.We anonymized the textures by replacing the face by the average face in CAESAR, after correcting it to match the skin tone of the texture.Textures are grouped according to the gender, which is randomly selected for each virtual human.
For the SHOF dataset the textures were split in training and testing sets with a 70/30 ratio, while each texture dataset is sampled with a 50% chance.For the MHOF dataset, we introduce more refined splitting with a 80/10/10 ratio for the train, validation and test sets.Moreover, since we introduce also finger motion, we want to favour sampling non-CAESAR textures, due to the bad quality of CAESAR texture maps for the finger region.Thus each texture is sampled with equal probability.
Hand Texture.Hands and fingers are hard to be scanned due to occlusions and measurement limitations.As a result, texture maps are particularly noisy or might even have holes.Since texture is important for optical flow, we augment the body texture maps to improve hand regions.For this we follow a divide and conquer approach.First, we capture handonly scans with a 3dMD scanner [15].Then, we create handonly textures using the MANO model [15], getting 176 high resolution textures from 20 subjects.Finally, we use the handonly textures to replace the problematic hand regions in the full-body texture maps.
We also need to find the best matching hand-only texture for every body texture.Therefore, we convert all texture maps in HSV space, and compute the mean HSV value for each texture map from standard sampling regions.For full body textures, we sample face regions without facial hair; while for hand-only textures, we sample the center of the outer palm.Then, for each body texture map we find the closest hand-only texture map in HSV space, and shift the values of the latter by the HSV difference, so that the hand skin tone becomes more similar to the facial skin tone.Finally, this improved hand-only texture map is used to replace the pixels in the hand-region of the full body texture map.
(Self-) Collision.The MHOF dataset contains multiple virtual humans moving differently, so there are high chances of collisions and penetrations.This is undesirable because penetrations are physically implausible and unrealistic.Moreover, the generated ground truth optical flow might have artifacts.Therefore, we employ a collision detection method to avoid intersections and penetrations.
Instead of using simple bounding boxes for rough collision detection, we draw inspiration from [52] and perform accurate and efficient collision detection on the triangle level using bounding volume hierarchies (BVH) [50].This level of detailed detection allows for challenging occlusions with small distances between virtual humans, that can commonly be observed for realistic interactions between real humans.This method is useful not only for inter-person collision detection, but also for self-intersections.This is especially useful for our scenarios, as re-targeting body and hand motion to people of different shapes might result in unrealistic self-penetrations.The method is applicable out of the box, with the only exception that we exclude checks of neighboring body parts that are always or frequently in contact, e.g.upper and lower arm, or the two thighs.

Scene Generation
Background texture.For the scene background in the SHOF dataset, we use random indoor images from the LSUN dataset [47].This provides a good compromise between simplicity and the complex task of generating varied full 3D environments.We use 417, 597 images from the LSUN cate- gories kitchen, living room, bedroom and dining room.These images are placed as billboards, 9 meters from the camera, and are not affected by the spherical harmonics lighting.
In the MHOF dataset, we increase the variability in background appearance, We employ the Sun397 dataset [48] that contains images for 397 highly variable scenes that are both indoor and outdoor, in contrast to LSUN.For quality reasons, we reject all images with resolution smaller than 512 × 512 px, and also reject images that contain humans using mask-RCNN [53,54].As a result, we use 30, 222 images, split in 24, 178 for the training set and 3, 022 for each of the validation and test sets.Further, we increase the distance between the camera and background to 12 meters, to increase the space in which the multiple virtual humans can move without colliding frequently to each other, while still being close enough for visual occlusions.
Scene Illumination.We illuminate the bodies with Spherical Harmonics lighting [49] that define basis vectors for light directions.This parameterization is useful for randomizing the scene light by randomly sampling the coefficients with a bias towards natural illumination.The coefficients are uniformly sampled between −0.7 and 0.7, apart from the ambient illumination, which has a minimum value of 0.3 to avoid extremely dark images, and illumination direction, which is strictly negative to favour illumination coming from above.
Increasing Image Realism.In order to increase realism, we introduced three types of image imperfections.First, for 30% of the generated images we introduced camera motion between frames.This motion perturbs the location of the camera with Gaussian noise of 1 cm standard deviation between frames and rotation noise of 0.2 degrees standard deviation per dimension in an Euler angle representation.Second, we add motion blur to the scene using the Vector Blur Node in Blender, and integrated over 2 frames sampled with 64 steps between the beginning and end point of the motion.Finally, we add a Gaussian blur to 30% of the images with a standard deviation of 1 pixel.Scene Compositing.For animating virtual humans, each MoCap sequence is selected at least once.To increase variability, each sequence is split into subsequences.For the first frame of each subsequence, we sample a body and background texture, lights, blurring and camera motion parameters, and re-position virtual humans on the horizontal plane.We then introduce a random rotation around the z-axis for variability in the motion direction.
For the SHOF dataset, we use subsequences of 20 frames, and at the beginning of each one the single virtual human is re-positioned in the scene such that the pelvis is projected onto the image center.
For the MHOF dataset, we increase the variability with smaller subsequences of 10 frames and introduce more challenging visual occlusions by uniformly sampling the number of virtual humans in the range [4,8].We sample MoCap sequences S j with a probability of , where |S j | denotes the number of frames of sequence S j and |S| the number of sequences.In contrast to the SHOF dataset, for the MHOF dataset the virtual humans are not re-positioned at the center, as they would all collide.Instead, they are placed at random locations on the horizontal plane within camera visibility, making sure there are no collisions with other virtual humans or the background plane during the whole subsequence.

Ground Truth Generation
Segmentation Masks.Using the material pass of Blender, we store for each frame the ground truth body part segmentation for our models.Although the body part segmentation for both models is similar, SMPL models the palm and fingers as one part, while SMPL+H has a different part segment for each finger bone.Figure 3 shows an example body part segmentation for SMPL+H.These segmentation masks allow us to perform a per body-part evaluation of our optical flow estimation.
Rendering & Ground Truth Optical Flow.For generating images, we use the open source suite Blender and its vector pass.The render pass is typically used for producing motion blur, and it produces the motion in image space of every pixel; i.e. the ground truth optical flow.We are mainly interested in the result of this pass, together with the color rendering of the textured bodies.

Learning
We train two different network architectures to estimate optical flow on both the SHOF and MHOF dataset.We choose compact models that are based on spatial pyramids, namely SPyNet [2] and PWC-Net [3], shown in Figure 4. We denote the models trained on the SHOF dataset by SPyNet+SHOF The spatial pyramid structure employs a convnet at each level of an image pyramid.A pyramid level works on a particular resolution of the image.The top level works on the full resolution and the image features are downsampled as we move to the bottom of the pyramid.Each level learns a convolutional layer d, to perform downsampling of image features.Similarly, a convolution layer u, is learned for decoding optical flow.At each level, the convnet G k predicts optical flow residuals v k at that level.These flow residuals get added at each level to produce the full flow, V K at the finest level of the pyramid.
In SPyNet, each convnet G k takes a pair of images as inputs along with flow V k−1 obtained by upsampling the output of the previous level.The second frame is however warped using V k−1 and the triplet In PWC-Net, a pair of image features, {I 1 k , I 2 k } is input at a pyramid level, and the second feature map is warped using using the flow V k−1 from the previous level of the pyramid.We then compute the cost-volume c(I 1 k , w(I 2 k , V k−1 )) over feature maps and pass it to network G k to compute optical flow V k at that pyramid level.We use the pretrained weights as initializations for training both SPyNet and PWC-Net.We train both models endto-end to minimize the average End Point Error (EPE).
Hyperparameters.We follow the same training procedure for SPyNet and PWC-Net.The only exception to this is the learning rate, which is determined empirically for each dataset and network from {10 −6 , 10 −5 , 10 −4 }.For the SHOF we found 10 −6 to yield best results for SpyNet.Predictions of PWC on the SHOF dataset do not improve for any of these learning rates.For training on MHOF a learning rate of 10 −6 and 10 −4 yield best results for SpyNet and PWC-Net, respectively.We use Adam [55] to optimize our loss with β 1 = 0.9 and β 2 = 0.999.We use a batch size of 8 and run 400, 000 training iterations.All networks are implemented in the Pytorch framework.Fine-tuning the networks from pretrained weights takes approximately 1 day on SHOF and 2 days on MHOF.
Data Augmentations.We also augment our data by applying several transformations and adding noise.Although our dataset is quite large, augmentation improves the quality of results on real scenes.In particular, we apply scaling in the range of [0.3, 3], and rotations in [−17 • , 17 • ].The dataset is normalized to have zero mean and unit standard deviation using [56].

Experiments
In this section, we first compare the SHOF, MHOF and other common optical flow datasets.Next, we show that fine-tuning SPyNet on SHOF improves the model, while we observe that fine-tuning PWC-Net on SHOF does not improve the model further.We then fine-tune the same methods on MHOF and evaluate them.We show that both, SPyNet and PWC-Net improve when fine-tuned on MHOF.We show that the meth- ods trained on the MHOF dataset outperform generic flow estimation methods for the pixels corresponding to humans.Finally, we show on qualitative results that both, the models trained on SHOF and models trained on MHOF seem to generalize to real word scenes.Dataset Details.In comparison with other optical flow datasets, our dataset is larger by an order of magnitude (see Table 3); the SHOF dataset contains 135, 153 training frames and 10, 867 test frames with optical flow ground truth, while the MHOF dataset has 86, 259 training, 13, 236 test and 11, 817 validation frames.For the single-person dataset we keep the resolution small at 256 × 256 px to facilitate easy deployment for training neural networks.This also speeds up the rendering process in Blender for generating large amounts of data.We show the comparisons of processing time of different models on the SHOF dataset in Table 4(a).For the MHOF dataset we increase the resolution to 640 × 640 px to be able to reason about optical flow even in small body parts like fingers, using SMPL+H.Our data is extensive, containing a wide variety of human shapes, poses, actions and virtual backgrounds to support deep learning systems.
Comparison on SHOF.We compare the average End Point Errors (EPEs) of optical flow methods on the SHOF dataset in Table 4, along with the time for evaluation.We show visual comparisons in Figure 5. Human motion is complex and general optical flow methods fail to capture it.Our trained network SPyNet+SHOF outperforms previous methods, and SPyNet [2] in particular.
We observe that FlowNet [10] shows poor generalization on our dataset.Since the results of FlowNet [10] in Table 4 and 6 are very close to the zero flow (no motion) baseline, we cross-verify by evaluating FlowNet on a mixture of Flying Chairs [10] and Human Optical Flow and observe that the flow outputs on SHOF is quite random (see Figure 5).The main reason is that SHOF contains a significant amount of small motions and it is known that FlowNet does not perform very well on small motions.SPyNet [2] however performs quite well and is able to generalize to body motions.The results however look noisy in many cases.
Our dataset employs a layered structure where a human is placed against a background.As such layered methods like PCA-layers [31] perform very well on a few images (row 8 in Figure 5) where they are able to segment a person from the background.However, in most cases, they do not obtain good segmentation into layers.
Previous state-of-the-art methods like LDOF [58] and Epic-Flow [34] perform much better than others.They get a good overall shape, and smooth backgrounds.However, their estimation is quite blurred.They tend to miss the sharp edges that are typical of human hands and legs.They are also significantly slower.
In contrast, by fine-tuning on our dataset, the performance of SPyNet+SHOF improves by 40% over SPyNet on the SHOF dataset.We also find that fine-tuning PWC-Net on the SHOF does not improve the model.Empirically, we have seen that PWC-Net already performs well for small motions, that are a significant portion of SHOF.This partially motivates the generation of the MHOF dataset, which includes larger motions and more complex scenes with occlusions.
A qualitative comparison to popular optical flow methods can be seen in Figure 5. Flow estimations of SPyNet+MHOF can be observed to be sharper than those of generic methods.This can especially be seen for edges.Furthermore, it can be seen that fine details like motion of hands are estimated more precisely.
Table 5).For these pixels, light-weight networks like SpyNet and PWC-Net improve over almost all generic optical flow estimation methods using our dataset (SpyNet+MHOF and PWC+MHOF), including the much larger network FlowNet2.PWC+MHOF is the best performing method.A more fine grained analysis of EPE across body parts is shown in Table 6.We obtain EPE of these body parts using the segmentation shown in Figure 3.It can be seen that improvements of PWC+MHOF over FlowNet2 are larger for body parts that are at the end of the kinematic tree (i.e.feet, calves, arms and in particular fingers).Differences are less strong for body parts close to the torso.One interpretation of these findings is that movements of the torso are easier to predict, while movements of body parts at the end of the kinematic tree are more complex and thus harder to estimate.In contrast, SPyNet+MHOF outperforms FlowNet2 on body parts close to the torso and does not learn to capture the more complex motions of limbs better than FlowNet2.
The above observations are strong indications that our Human Optical Flow datasets (SHOF and MHOF) can be beneficial for the performance on human motion for other optical flow networks as well.
Real Scenes.We show a visual comparison of results on real-world scenes of people in motion.For visual comparions of models trained on the SHOF dataset we collect these scenes by cropping people from real world videos as shown in Figure 7.We use DPM [60] for detecting people and compute bounding box regions in two frames using the ground truth of the MOT16 dataset [61].The results for the SHOF dataset are shown in Figure 8.A comparison of methods on real images with multiple people can be seen in Figure 9.
The performance of PCA-Layers [31] is highly dependent on its ability to segment.Hence, we see only a few cases where it looks visually correct.SPyNet [2] gets the overall shape but the results look noisy in certain image parts.While LDOF [58], EpicFlow [34] and FlowFields [59] generally perform well, they often find it difficult to resolve the legs, hands and head of the person.The results from models trained on out Human Optical Flow dataset look appealing especially while resolving the overall human shape, and various parts like legs, hands and the human head.Models trained on the Human Optical Flow dataset perform well under occlusion , LDOF [58], PCA-Layers [31], EpicFlow [34], SPyNet [2], SPyNet+MHOF (ours), PWC-Net [3] and PWC+MHOF (ours).(Figure 8, Figure 9).Many examples including severe occlusion can be seen in Figure 9.Besides that, Figure 9 shows that the models trained on MHOF are able to distinguish motions of multiple people and predict sharp edges of humans.

Conclusion and Future Work
In summary, we created an extensive Human Optical Flow dataset containing images of realistic human shapes in motion together with ground truth optical flow.The dataset is comprised of two parts, the Single-Human Optical Flow (SHOF) and the Multi-Human Optical Flow (MHOF) dataset.We then train two compact network architectures based on spatial pyramids, namely SpyNet and PWC-Net.The realism and extent of our dataset, together with an end-to-end training scheme, allows these networks to outperform previous stateof-the-art optical flow methods on our new human-specific dataset.This indicates that our dataset can be beneficial for other optical flow network architectures as well.Furthermore, our qualitative results suggest that the networks trained on the Human Optical Flow generalize well to real world scenes with humans.The trained models are compact and run in real time making them highly suitable for phones and embedded devices.
In future work, we plan to add 3D clothing and accessories to humans in the scene.The dataset and our focus on human optical flow opens up a number of research directions in human motion understanding and optical flow computation.We would like to extend our dataset by modeling more diverse clothing and outdoor scenarios.A direction of potentially high impact for this work is to integrate it in end-to-end systems for action recognition, which typically take precomputed optical flow as input.The real-time nature of the method could support motion-based interfaces, potentially even on devices like cell phones with limited computing power.The dataset, dataset generation code, pretrained models, and training code are available, enabling researchers to use them for problems involving human motion.
Fig.1(a) We simulate human motion in virtual worlds creating an extensive dataset with images (top row) and flow fields (bottom row); color coding from[1].(b) We train SPyNet[2] and PWC-Net[3] for human motion estimation and show that they performs better when trained on our dataset and (c) can generalize to human motions in real world scenes.Columns show single-person and multi-person cases alternately.

Fig. 3
Fig. 3 Body part segmentation for the SMPL+H model.Symmetrical body parts are labeled only once.Finger joints follow the same naming convention as shown for the thumb.(Best viewed in color)

Fig. 4
Fig. 4 Spatial Pyramid Network [2] (left) and PWC-Net [3] (right) for optical flow estimation.At each pyramid level, network G k predicts flow at that level which is used to condition the optical flow at the higher resolution level in the pyramid.Adapted from [3].

Fig. 7
Fig.7We use the DPM[60] person detector to crop out people from real-world scenes (left) and use SPyNet+SHOF to compute optical flow on the cropped section (right).

Table 1
Comparison of datasets and most important data preprocessing steps used to generate the SHOF and MHOF datasets.A short description of the respective part is provided in the last column.

Table 2
Comparison of tools, Blender passes and parameters used to generate the SHOF and MHOF datasets.The last column provides a short description of the respective method.

Table 3
Comparison of the Human Optical Flow datasets, namely the Single-Human Optical Flow (SHOF) and the Multi-Human Optical Flow (MHOF) dataset, with previous optical flow datasets.

Table 4
EPE comparisons and evaluation times of different optical flow methods on the SHOF dataset.Zero refers to the EPE when zero flow (no motion) is always used for evaluation.Evaluation times are based on the SHOF dataset with 256 × 256 image resolution.We time all GPU based methods using a Tesla V100-16GB GPU.

Table 5
Comparison using End Point Error (EPE) on the Multi-Human Optical Flow (MHOF) dataset.We show the average EPE and body-only EPE.The EPE is computed only over segments of the image depicting a human body.Best results are shown in boldface.A comparison of body-part specific EPE can be found in Table

Table 6
Comparison using End Point Error (EPE) on the Multi-Human Optical Flow (MHOF) dataset.We show the average EPE and body part specific EPE, where part labels follow Figure3.The first two rows are repeated from Tab 5.