1 Introduction

Although challenging, marker-less real time 3D human pose estimation is attracting increasing research interest as it will deliver step changes to a wide range of fields, from biomechanics, psychology, animation, human computer interaction and computer vision. The desire is to regress and estimate a 3D location based limb skeleton of a human in a range of environments as shown in Fig. 1. However, 3D pose estimation suffers from a large number of challenges including large variation in appearance, arbitrary viewpoints and obstructed visibilities due to external entities and self-occlusions. To resolve these challenges effectively, marker based systems such as Vicon (http://www.vicon.com) or OptiTrack (http://www.optitrack.com) are commonly used to provide sufficient joint accuracy.

However, the requirement to wear a special suit or a large number of physical markers is intrusive and restricts both the performance environment and the range of motions the subject can perform. Also, heavy occlusion from other actors or props in the scene, or adverse illumination can cause these approaches to fail in practical deployments. Therefore approaches have tried to remove these constraints through the use of elaborate prior terms and body modelling (von Marcard et al. 2017), or with the use of depth cameras (Yub et al. 2016), or extending 2D estimation to 3D (Tome et al. 2017; Tan et al. 2017).

Nevertheless, such systems based purely upon computer vision, suffer from inaccuracies or are restricted by using complex priors. We propose a compromise, via the fusion of vision based 3D pose estimation and Inertial Measurement Units (IMUs) (Roetenberg et al. 2009, http://www.neuronmocap.com) to estimate pose accurately. IMUs are small boxes placed on key body parts that don’t suffer from illumination or occlusion failures, IMUs, however, do suffer from drift and therefore cannot provide the full solution without the visual component. Given the complementary nature of the two modalities, we fuse vision and IMU to estimate the 3D joint skeleton of human subjects. We show that by incorporating both cues, we can migiate the limitations of the drift and lack of spatial positional information in IMU data and the requirement of learnt complex human models for the vision. The complementary modalities mutually reinforce one another during inference; as rotational and occlusion ambiguities are mitigated by the IMUs while global positional drift and context are reduced by the vision.

Fig. 1
figure 1

Our approach regresses 3D estimates for varied pose, subjects and environment

Fig. 2
figure 2

Our two-stream network fuses IMU data with volumetric (PVH) data derived from multiple viewpoint video (MVV) to learn an embedding for 3D joint locations (human pose)

Our proposed solution combines foreground occupancy mattes and semantic 2D pose estimates from a number of wide baseline video cameras to form a multi channel probabilistic visual hull (PVH) (Grauman et al. 2003). A coarse discretisation of the 3D space around the performer is then used to train a 3D convolutional network to predict 3D joint estimates from the volumetric PVH data. The contextual frame-wise temporal consistency of the 3D pose estimates is learnt with a variant of a Recurrent Neural Network (RNN) using LSTM layers. The LSTM learns a predictive model given a small number of previous frames. Concurrently IMUs are used to solve a simple kinematic model to provide a further 3D joint estimation, and both are then fused in an additional dense neural layer. The two data modes are illustrated in Fig. 2.

It is well known that training deep networks from scratch requires a large amount of data, and this requirement is heightened given the use of 3D convolutional layers in our work. Also, there is no single dataset available containing IMU and MVV video with a high-quality ground truth. Therefore we release a multi-subject, multi-action dataset as a further contribution to this work. The initial solution of this work was presented at BMVC 2017 (Gilbert et al. 2017). In this paper, we make several additional contributions. First, we enhance our initial 3D convolutional network for pose estimation through the incorporation of semantic pose information encoded in additional channels within volumetric data. We show that this information delivers a significant step-up in performance, resulting in an improved state of the art performance in both the public TotalCapture and Human36M datasets. In addition to deep analysis of these networks, we also introduce a novel dataset TotalCaptureOutdoor (Malleson et al. 2017) upon which we evaluate our system. The additional analysis within the experimental section (Sect. 4.5) allows greater insight into the contribution of the individual components while the methodology is expanded, allowing the reader further insight into our implementation.

2 Related Work

Human pose estimation can be split into two broad categories; a top-down approach to fit an articulated limb kinematic model to the source data and those that use a data driven bottom-up approach.

Top-down approaches to 2D pose estimation fit an articulated limb model to data incorporating kinematics into the optimisation to bias toward possible configurations. Lan and Huttenlocher (2005) provide a top down model based approach, considering the conditional independence of parts; however inter-limb dependencies (e.g. symmetry) are not considered. A more global treatment is proposed in Jiang (2009) using linear relaxation but performs well only on uncluttered scenes. The fusion of pictorial structures with Ada-Boost shape classification was explored in Andriluka et al. (2009). Agarwal and Triggs used non-linear regression to estimate pose in 2D silhouette images (Agarwal et al. 2004). The SMPL model (Loper et al. 2015) provides a rich statistical body model that can be fitted to incomplete data and von Marcard et al. (2017) incorporated IMU measurements with it to provide pose estimation without visual data. While (Tan et al. 2017) employs the SMPL model to estimate the 3D pose from 2D images in a decoder/encoder framework. Then, Huang et al. (2017) combines the SMPL body model with 2D joint estimates to reinforce and improve the 3D pose.

Bottom-up pose estimation is driven by image parsing to isolate components, Srinivasan and Shi (2007) used graph-cuts to parse a subset of salient shapes from an image and group these into a model of a person. Ren et al. (2005) recursively splits Canny edge contours into segments, classifying each as a putative body part using cues such as parallelism. Ren and Collomosse (2012) also used Bag of Visual Words for implicit pose estimation as part of a pose similarity system for dance video retrieval. More recently studies have begun to leverage the power of convolutional neural networks, following in the wake of the eye-opening results of Krizhevsky et al. (2012) on image recognition. Toshev and Szegedy (2014), in the DeepPose system, used a cascade of convolutional neural networks to estimate 2D pose in images. Descriptors learned by a CNN have also been used in 2D pose estimation from very low resolution images (Park and Ramanan 2015). Elhayek et al. (2015) used MVV with a Convnet to produce 2D pose estimations while Rhodin et al. (2016) minimised the edge energy inspired by volume ray casting to deduce the 3D pose. More recently given the success and accuracy of 2D joint estimation (Cao et al. 2016), an increasing number of works have been introduced to transfer those predictions into 3D, using a post processing optimisation step. Sanzari et al. (2016) estimates the location of 2D joints, before predicting 3D pose using appearance and probable 3D pose of the discovered parts with a hierarchical Bayesian model. While Zhou et al. (2016) integrates 2D, 3D and temporal information to account for uncertainties in the data. The challenge of estimating 3D human pose from MVV is currently less explored, although 3D pose estimation is generally cast as a coordinate regression task, with the target output being the spatial xyz coordinates of a joint with respect to a known root node such as the pelvis. Trumble et al. (2016) used a flattened MVV based spherical histogram with a 2D convnet to estimate pose. While Pavlakos et al. (2017a) used a simple volumetric representation in a 3D convnet for pose estimation and Wei et al. (2016) performed related work in aligning pairs of joints to estimate 3D human pose. Differently, Huang et al. (2015) constructed a 4-D mesh of the subject from video reconstruction to estimate the 3D pose. While Tekin et al. (2016a) included a pretrained autoencoder within the network to enforce structural constraints.

Another challenge of MVV is the labelling of the training data, therefore Rogez and Schmid (2016) artificially augments a dataset of real images with 2D human pose annotations using 3D Motion Capture data. Given a candidate 3D pose, the algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. Similarly Lassner et al. (2017) uses the SMPL (Loper et al. 2015) body model to generate training data without motion capture.

Fig. 3
figure 3

Network architecture comprising two streams: a 3D Convnet for MVV pose embedding, and kinematic solve from IMUs. Both streams pass through LSTM before the Fusion of the concatenated estimates in a further FC layer

To predict temporal sequences, RNNs and their variants including LSTMs (Hochreiter and Schmidhuber 1997) and Gated Recurrent Units (Chung et al. 2014) have recently shown to learn and generalise the properties of temporal sequences successfully. Graves (2013) was able to predict isolated handwriting sequences, while in Natural language processing (NLP) Graves and Jaitly (2014) combines an LSTM model with Connectionist Temporal Classification objective function, directly transcribing audio data with text. Alahi et al. (2016) was also able to predict human trajectories of crowds by modelling each human with an LSTM and jointly predicting the paths.

In the field of IMUs, there has been a number of works that have used IMUs to estimate pose, Roetenberg Roetenberg et al. (2009), used 17 IMUs with 3D accelerometers, gyroscopes and magnetometers fused with a Kalman filter to define the pose of a subject. Slyper and Hodgins (2008) reconstructs pose using 5 accelerometers to retrieve pre-recorded poses with similar accelerations via a lookup process from a database. Acceleration data is however very noisy and the search space of possible accelerations is under constrained making the learning a very difficult task. While (Schwarz et al. 2009) directly regresses full pose using only 4 IMUs with a Gaussian Process regression, with good results when the test motions are present in the database. Similarly Pons-Moll et al. (2011) uses a particle filter framework to optimise the orientation constrained by IMU samples taken from a manifold of poses, to solve for outdoor sequences. Also, Liu et al. (2011) regress to a full pose querying a database of online local models based on the response of 6 IMUs.

The initial work to fuse IMU and video was by Pons-Moll et al. (2010), combining limb orientations from the inertial sensors, with stable and drift-free accurate position information from video data. While Marcard et al. (2016) fused video and IMU data to improve and stabilise full body motion capture. Helten et al. (2013) used a single depth camera with IMUs to track the entire body, with the IMUs identifying similar candidate poses and the depth data being used to obtain the full body estimate. Andrews et al. (2016) used a sparse set of labelled optical markers, IMUs, and a motion prior in an inverse dynamics formulation. While Malleson et al. (2017) used IMUs with a full kinematic solve to effectively estimate 3D pose indoor and outdoor.

3 Methodology

An overview of the approach is shown in Fig. 3, a 3D volumetric geometric proxy of the performer is formed from 2D foreground occupancy and 2D semantic heat maps, with a multi-channel probabilistic visual hull. This coarse visual hull is fed into a 3D convnet that directly regresses an embedding that encodes 3D spatial joint locations of the performer’s body. A temporal model from a recurrent neural network is trained on the embedding to enforce temporal consistency to the 3D pose detections. Uniquely for this work, IMU data on key body parts is used to enable a forward kinematic solve of the pose that is smoothed with a learnt temporal RNN model. Given the complementary nature of the two data modes, a dense layer fuses both to provide a joint based embedding of the joint locations.

3.1 Volumetric Pose Embedding

Figure 3 shows a diagram of our architecture; it is based on a deep, multilayer neural network that consists of successive 3D convolutional and pooling layers. The goal of CNN pose regression is to obtain 3D Cartesian coordinates of J joints given the multi-channel 3D probabilistic visual hull volume. The target of the network is \(3*J\)-dimensional vector comprised of the concatenation of the xyz coordinates of the J joints of the human body, for our work \(J=17\), resulting in 51 final layer embedding (\(3*17\)).

The detailed filter parameters are listed in Table 1 for each layer in Fig. 3. By using 3D convolution filters, we are able to encode information from all cameras as a volume simultaneously. In training, the network is supervised with an L2 regression loss:

$$\begin{aligned} \mathcal {L} = \sum ^J_{j=1} \Vert p_{gt}^j - p_{pr}^j \Vert ^2_2. \end{aligned}$$
(1)

where \(p_{gt}^j\) is the groundtruth location for joint j and \(p_{pr}^j\) is the predicted location for joint j. The location of each joint is expressed globally, normalised to a root joint or node at the pelvis. To further encourage pose invariance with respect to the facing direction of the performer, the training data is augmented by applying a random rotation about the central vertical axis, \(\theta =[0,2\pi ]\).

Table 1 Parameters of the 3D Convnet used to infer the MVV pose embedding

3.2 Visual Channels

Two visual channels are employed, a 2D occupancy matte, and semantic 2D joints. The occupancy is a soft probability of foreground occupancy formed from the comparison of the current frame I and a clean-plate P taken before the recorded sequence. The thresholded L2 distance between the two images in the HSV colour domain provides the soft occupancy probability for the 1st channel. The second semantic channel consists of a human joint belief labels estimated by OpenPose (Wei et al. 2016; Cao et al. 2017), a multi-stage process that iteratively refines 2D pose estimations of joint positions using a mixture of knowledge of the image and the estimates of joint locations of the previous stage. At each stage s and for each joint label j the algorithm returns dense per pixel belief maps \(m^{j}_{s}\), which provides the confidence of a joint centre for any given pixel (xy), and given stage s. Much of the algorithm’s power is that in stages \(s \in {2, . . . , S}\) the belief maps are a function of not just the information contained in the image but also the information computed by the previous stage. For this work we transform these per joint belief maps into a single label image M, by maximising over the confidence of all possible joint labels on a per pixel basis.

$$\begin{aligned} M(x,y) = \mathop {{{\mathrm{arg max}}}}\limits _{J} m_{S}^{j}(x,y) \end{aligned}$$
(2)

Figure 4 shows an example of the soft occupancy and joint labels for an example image.

Fig. 4
figure 4

An example of the foreground occupancy and 2D joint label belief map (white indicates high probability of occupancy)

3.3 Volumetric Representation of Proxy

Many recent approaches use multiple 2D views (Pavlakos et al. 2017b) or infer 3D from a learnt 2D lookup (Tome et al. 2017; Chen and Ramanan 2017). However, we propose to simultaneously use multiple 2D views to produce a crude but accurate 3D representation of the human body. Integrating the multiple views into a 3D shape overcomes the unavoidable ambiguities and occlusions present in individual 2D images. However, the cost is the exponential increase in dimensionality over 2D, and also the lack of a pre-trained imagenet based model (Krizhevsky et al. 2012). Therefore to allow the training to be tractable and still provide the increase in detail over 2D, we propose to use a multi-channel based probabilistic visual hull (PVH) (Grauman et al. 2003) to infer the 3D occupancy shape from multiple camera views. A PVH quantises the volume occupancy in a soft probabilistic computation that greatly reduces the dimensionality while maintaining the detail. The volumetric representation is agnostic to the source of the data, and for this work, we propose to use both 2D foreground occupancy mattes and semantic 2D joint labels. Both are noisy and contain failure cases as a single view. However, the probabilistic nature of the PVH ensures that noise is ignored and only a consistent signal is propagated to the 3D volume.

Given a set of C wide baseline cameras, \(c=\left[ 1, \dots , C\right] \), where \(C>3\) surrounding a performance volume, and calibrated with a known orientation, \(\mathbf {R}_c\), focal point \({COP}_c\), focal length \(f_c\) and optical centre \(o^x_c, o^y_c\), the camera parameters for a given camera c are

$$\begin{aligned} \{\mathbf {R}_c, {COP}_c, f_c, o^x_c, o^y_c\} \end{aligned}$$
(3)

The 3D Capture Volume is finely decimated into voxels \(v=\left[ 1, \dots , V\right] \) approximately \(10\,\mathrm {mm}^3\) in size. Then given an 2D image denoted as \(I_c\), with \(\Phi =\left[ 1, \dots , \phi \right] \) channels the voxel occupancy from a given camera view c is defined as the probability:

$$\begin{aligned} p(V | c) = I_c(x[v_i],y[v_i],\phi ) \end{aligned}$$
(4)

where given a 2D image coordinate position (xy) the voxel \(v_i\) projects to a real world 3D position of:

$$\begin{aligned}&x[v_i]=\frac{f_c v^x_i}{v^z_i}+o^x_c~~~\mathrm {and}~~~y[v_i]=\frac{f_c v^y_i}{v^z_i}+o^y_c, \end{aligned}$$
(5)
$$\begin{aligned}&\mathrm {where}~~~ \left[ \begin{array}{ccc} x &{}y &{}z\\ \end{array}\right] = {COP}_c + R_c^{-1} v_i. \end{aligned}$$
(6)

where \(\left[ \begin{array}{ccc}x&y&z \end{array}\right] \) is the 3D real world global coordinate location. Therefore the overall probability of occupancy for a given voxel \(p(v,\phi )\) is the product over all views:

$$\begin{aligned} p(v_i,\phi ) = \prod _{i \in C}^C p(v|c), \end{aligned}$$
(7)

this is then computed for all voxels in the volume

$$\begin{aligned} \sum _{i \in V} \sum _{j \in \Phi } p(v_i,\phi _j) \end{aligned}$$
(8)

The fine grained voxel occupancy approximation is then down sampled via a weighted Gaussian filter to the coarse input shape and size of the first layer in the convnet, 30 \(\times \) 30 \(\times \) 30, this roughly approximates with the same number of pixels as a \(150 \times 150\) 2D image, where each voxel approximates a 67 \(\times \) 67 \(\times \) 67 mm volume in the real world.

3.4 Inertial Pose Estimation

To estimate the pose from joint orientations, Xsens IMUs (Roetenberg et al. 2009) are placed on key body parts to estimate the pose. The end rigid joints provide the most discriminative data and will constrain the pose parameters effectively when fused later with the vision. The pose optimisation of Malleson et al. (2017) is used, this aims to minimize the energy of the following Equation:

$$\begin{aligned} E(\theta ) = \overbrace{E_{R}(\theta )+E_A(\theta )}^{Data} + \overbrace{E_{PP}(\theta )+E_{PD}(\theta )}^{Prior} \end{aligned}$$
(9)

where \(E_R(\theta )\), and \(E_A(\theta )\) are orientation and acceleration constraints, respectively and \(E_{PP} (\theta )\) and \(E_{PD}(\theta )\) are the pose projection and pose deviation priors, respectively.

For each IMU, \(k \in [1, 13]\), we assume rigid attachment to a bone and calibrate the relative orientation, \(\mathbf {R}^k_{kb}\), between the IMU k and the bone b. The reference frame of the IMUs, \(\mathbf {R}_{kw}\), is also calibrated approximately against the global world w coordinates. Each local IMU orientation measurement, \(\mathbf {R}^k_{m}\), is transformed to a global bone orientation, \(\mathbf {R}^k_{b}\) as follows:

$$\begin{aligned} \mathbf {R}^k_{b} = (\mathbf {R}^k_{kb})^{-1} \mathbf {R}^k_{kw} \mathbf {R}^k_{m} \end{aligned}$$
(10)

Then the local (hierarchical) joint rotation, \(\mathbf {R}^k_h\), for a given bone b in the skeleton is inferred by the kinematic chain:

$$\begin{aligned} \mathbf {R}^k_{h} = \mathbf {R}^k_{b} (\mathbf {R}^{par(b)}_{b})^{-1} \end{aligned}$$
(11)

where par(b) is the parent of bone b. The forward kinematics begins at the root and proceeds down the joint tree (with unmeasured bones kept fixed).

In addition to orientation, the IMUs provide local acceleration measurements and a window of three frames, t (current frame), and previous two frames t1 and t2 is used. For each IMU, a constraint is added which seeks to minimize the difference between the measured and solved acceleration of the track target site. The solved acceleration is computed using central finite differences using the solved pose from previous two frames along with the current frame being solved. The local accelerations from the previous frames of IMU data are converted to global coordinates in a similar method to Eq. 10 but gravity is also removed.

We use two priors based on the PCA of the pose: PCA projection (\(E_{PP}\)) and PCA deviation (\(E_{ED}\)). The projection prior encourages the solved body pose to lie close to the reduced dimensionality subspace of prior poses (a soft reduction in the degrees of freedom of the joints), while the deviation prior discourages deviation from the prior observed pose variation (soft joint rotation limits). Together these terms produce soft constraints that yield plausible motion while not strictly enforcing a reduced dimensionality on the solved pose, thus allowing novel motion to be more faithfully reproduced at run time. For full details of the cost functions used please see Malleson et al. (2017).

These joint orientations in conjunction with the calibrated performer’s skeleton allow for joints locations to be inferred to a concatenated joint vector \(\mathbf {J}_i\). For a more detailed description of relating inertial data to other sensor model coordinate systems the work by Baak et al. (2010) can provide further details. To temporally align the IMU and video data an initial foot stamp was performed by the subject, which was visible in the video and produces a strong peak in acceleration in the IMU data. The inertial reference frame of each IMU, \(\mathbf {R}^k_{kw}\) is assumed to be consistent between IMUs and in alignment with the world coordinates through the global up direction and magnetic north. The IMU-bone positions \(t_{kb}\) are specified by manual visual alignment and the IMU-bone orientations Rib are calibrated using the measured orientations with the subject in a known pose (the T-pose, facing the direction of a given axis).

3.5 Learnt Temporal Consistency

Given the temporal nature of human pose sequences, it is desirable to learn and enforce temporal consistency on the two streams of per frame pose estimation. Thus allowing the rich temporal motion patterns between frames and joints to be effectively incorporated into the 3D pose prediction. Long Short Term Memory (LSTM) layers (Hochreiter and Schmidhuber 1997) have provided excellent performance in exploiting longer term temporal correlations compared to standard recurrent neural networks on many tasks, e.g. speech recognition (Sak et al. 2014) and video description (Donahue et al. 2015). LSTM layers can store and access information over long periods of time but mitigate the vanishing gradient problem common in RNNs through a specialised gating mechanism.

Given an input vector \(\mathbf {J}_i(t)\) at time t consisting of concatenated joint spatial coordinates and resulting output joint vector \(\mathbf {J}_o(t)\). The aim is to learn the function that minimises the loss between the input vector and the output vector \(\mathbf {J}_o = o_t \circ tanh(c_t)\) (\(\circ \) denotes the Hadamard product), \(o_t\) is the output gate, and \(c_t\) is the memory cell, a combination of the previous memory \(c_{t-1}\) multiplied by a forget gate, and the input gate as shown in Fig. 5. Thus, intuitively it is a combination of the previous memory and the new input. For example, the old memory could be completely ignored (forget gate all 0s) or ignore the newly computed state completely (input gate all 0s), but in practice it is of course between those two extremes. The memory cell \(c_t\) is shown in Eq. 12.

$$\begin{aligned} c_{t} = f_{t} \circ c_{t-1} + i_t \circ \tanh (\mathbf {J}_{i}(t) U_{g} + \mathbf {J}_i(t-1)W_{g}) \end{aligned}$$
(12)
Fig. 5
figure 5

The design and connections of an LSTM layer.

Within each gates in there are two weights that are learnt, W and U. The input gate \(i_t\) defines the extent to which the newly computed state for the current input \(\mathbf {J}_i(t)\) is kept in the memory,

$$\begin{aligned} i_t = W_i \mathbf {J}_{i}(t) + U_i \mathbf {J}_i(t-1) \end{aligned}$$
(13)

A forget gate \(f_t\) defines how much of the previous state remains in memory,

$$\begin{aligned} f_t = W_f \mathbf {J}_{i}(t) + U_f \mathbf {J}_i(t-1) \end{aligned}$$
(14)

and an output gate \(o_t \) defines how much of the internal state is exposed to the external network (higher layers and the next time step).

$$\begin{aligned} o_t = W_o \mathbf {J}_{i}(t) + U_o \mathbf {J}_i(t-1) \end{aligned}$$
(15)

To learn the weights, they are trained using back propagation employing the loss function from Eq. 1 Each data modality has a distinct layer, with the temporal consistency using the previous f frames to predict the current frame joint vector for both the visual and IMU pose based estimation. With two layers both with 1024 memory cells, a look back of \(f=5\) and a learning rate of \(10^{-3}\) trained with RMS-prop (Dauphin et al. 2015).

3.6 Modality Fusion

The vision and IMU sensors both independently provide a 3D coordinate per joint estimate to reconstruct the performer’s pose. Therefore, it would make sense to incorporate both modes into the final estimate, given their complementary nature. Naively, an average pool of the two joint estimates could be used; this would be fast and efficient assuming both modalities have small errors. However, it is likely that often significant errors will be present on one of the modes due to their different measurement approaches. We, therefore, propose to fuse the two modes with a further fully connected layer. We are able to utilise the idea of using a dense layer to fuse our visual and IMU joint skeleton predictions, that can combine both measurements in a more meaningful way than simply taking the average. This allows errors in the pose from the vision and IMU to be identified and corrected by the combined fused model. This fully connected dense layer consists of 64 units and was trained with an RMS-prop optimiser (Dauphin et al. 2015) with a learning rate of \(10^{-4}\) to provide the feedback to reinforce the prediction. All stages of the model are implemented using Tensorflow.

4 Evaluation

To provide an evaluation of our approach we employ three different datasets. First we present results on the multichannel vision stream only of the approach on the Human3.6M (Ionescu et al. 2014) dataset in Sect. 4.1 without the IMU fusion. We then introduce our new dataset called TotalCapture (Gilbert et al. 2017) in Sect. 4.2, which contains both video and IMU with the associated GT joint skeleton. We evaluate our full fused vision and IMU approach on the TotalCapture dataset and we also perform an ablation study in Sect. 4.4 to examine the individual contributions of our work. Finally, we evaluate the ability of our approach to generalise to new sequences by evaluating on the challenging TotalCaptureOutdoor (Malleson et al. 2017) in Sect. 5 a challenging collection of sequences of MVV and IMU captured in a challenging outdoor environment.

Fig. 6
figure 6

Example pose estimates from the Human 3.6M dataset from two viewpoints

4.1 Human 3.6M

We evaluate 3D pose estimation on the Human 3.6M dataset (Ionescu et al. 2014) where 3D ground truth key points are available from a marker-based motion capture system. It consists of 3.6 million video frames captured on four camera viewpoints in a 360-degree arrangement. There are five female and six male subjects, performing typical activities such as posing, sitting and giving directions. There is no IMU data within the dataset, and so we only evaluate the visual component, the PVH + LSTM. This is the upper red and green layers from Fig. 3 without the fusion of the IMU kinematic solve. To allow comparison to other approaches we follow the same data partition protocol as in previous works  (Ionescu et al. 2014; Li et al. 2015; Tekin et al. 2016, 2016a; Tome et al. 2017; Gilbert et al. 2017). The training data consists of subjects S1, S5, S6, S7, S8 and it is tested on unseen subjects S9, S11. The standard 3D Euclidean error metric is used to evaluated accuracy, it calculates the Euclidean error averaged over all frames and 17 joints (in human 3.6M) in millimetres (mm). The Results of our multi-channel 3D volumetric approach with the temporal consistency are evaluated qualitatively in Fig. 6 and quantitatively in Table 2, in particular we compare to the approach of  Mude Lin Liang Lin and Cheng (2017) who use 2D joint estimates with a 3D recurrent network,  Tome et al. (2017), which infers 3D probabilistic estimates from monocular 2D joint predictions. Also we compare to a baseline approach Tri CPM LSTM, a 3D triangulated version of the 2D pose estimation (Cao et al. 2016) with error rejection. In this approach per camera 2D joint estimates

$$\begin{aligned} \mathbf {J}_{cpm} = \mathop {{{\mathrm{arg max}}}}\limits _{x,y} m_{S}^{j}(x,y) \end{aligned}$$
(16)

are triangulated into a 3D point, using an error rejection method that maximises the number of 2D estimates with the lowest 3D re-projection error. This is a frame wise detection based approach, and therefore temporal consistency is introduced with two learnt LSTM layers as described in Sect. 3.5, Tri CPM LSTM.

Table 2 A comparison of our approach to other works on the Human 3.6M dataset, multiview indicates whether the approach uses multiple camera views [the works of Martinez et al. (2017) and Trumble et al. (2018) where published after the time of submission]

As one can see from Table 2, our proposed approach outperforms all compared methods at time of publication [the newer works of Martinez et al. (2017) and Trumble et al. (2018)] indicate the speed of improvement in field of 3D pose estimation) despite excluding the fusion with the kinematic based IMU, with the mean error reduced by 15% compared with (Tome et al. 2017), the Tri CPM LSTM approach and our previous method (Gilbert et al. 2017). While compared to the state of the art results by Mude Lin Liang Lin and Cheng (2017), many activities have a similar error around 5 or 6cm. However, there is marked performance improvement in our approach for the activities; dog walking and sitting down, while Lin achieves better performance for greeting and waiting. Qualitative comparison to the ground truth is shown in Fig. 6, it shows the high degree of accuracy achievable, representing complex human poses. Although as shown in the bottom right pose, some unusual poses, probably not sufficiently represented in the training data, are still poorly estimated. To validate the superiority of the proposed multi-channel and temporally consistent approach, we evaluate the Human3.6M dataset with separate parts of the approach in Table 3.

Table 3 Empirical study on the performance of the different parts of the approach on the Human 3.6M dataset

It can be seen that the single channels of Matte or CPM based PVH perform worse than the multi-channel PVH, with both channels combined. This is likely to be due to the semantic information of the CPM labels complementing the occupancy based soft mattes. Also the improvement for enforcing the temporal consistency through the LSTM can be seen to be around 25 mm on average.

4.2 Total Capture

In recent years, high quality labelled datasets have been a catalyst for rapid development in a number of areas including object recognition (Deng et al. 2009) and 2D human pose datasets (Andriluka et al. 2014; Lin et al. 2014). These have been hand labelled, providing excellent accuracy and detail, however, this is far harder in 3D, where the labelling still in general relies on expensive and less common optical motion capture systems such as (http://www.vicon.com). This constraint greatly reduces the quantity and variability of existing datasets; Table 4 shows the features of current 3D human pose datasets. As can be seen Human3.6M has a large amount of synchronised multi-view video and is popular, however no IMU sensor data. HumanEva, is a smaller dataset also with no IMU information. While TNT15, contains IMU data and MVV it is small in size. Given these restrictions, we propose a new dataset to address these short comings, TotalCapture.Footnote 1 It contains a large amount of MVV, and synchronised IMU and Vicon labelling for ground truth. It was captured indoors in a volume measuring roughly 8 \(\times \) 4 m with 8 calibrated HD video cameras at 60 Hz. The variation in the dataset is shown in Fig. 7. To provide accurate labelled ground truth, the optical marker based (http://www.vicon.com) system was utilised, calculating 17 3D joint positions and angles, by triangulating small (\(0.5\,\mathrm {cm}^3\)) dots visible to infrared cameras, note these dots are not used explicitly by our algorithm, and their size is negligible compared to the performance volume. The IMU data is provided by 13 sensors on key body parts, head, upper/lower back, upper/lower arms and legs and feet, providing per unit orientation and acceleration. The location of the IMU sensors is shown in Fig. 8. The dataset consists of four male and one female subjects each performing four diverse performances, repeated three times: ROM, Walking, Acting and Freestyle, with each sequence lasting around 3000–5000 frames. An example of each performance and subject variation is shown in Fig. 7. There is a total of 1,892,176 frames of synchronised video, IMU and Vicon data (although some are withheld as test footage for unseen subjects). The variation and body motions contained in particular within the acting and freestyle sequences are very challenging with actions such as yoga, giving directions, bending over and crawling performed in both the train and test data. The train and test partitions are performed wrt to the subjects and sequences, the training consists of ROM1, 2, 3; Walking1, 3; Freestyle1, 2 and Acting1, 2 on subjects 1, 2 and 3. The test set is the performances Freestyle3 (FS3), Acting (A3) and Walking 2 (W2) on subjects 1, 2, 3, 4 and 5. This split allows for a comparison of unseen and seen subjects but always unseen sequences.

Table 4 Characterising existing 3D human pose datasets and TotalCapture
Fig. 7
figure 7

Examples of performance variation in the proposed TotalCapture dataset

4.3 Total Capture Evaluation

To provide a reference of our approach to other methods we compare to three state of the art approaches, the 3D triangulated CPM, Tri-CPM, described in Sect. 4.1 a flattened multi-view matte based 2D convolutional neural network approach (Trumble et al. 2016), 2D Matte, and our previously published results without the semantic 2D pose labels in the probabilistic visual hull (Gilbert et al. 2017). The results are shown with and without temporal consistency provided by the learnt LSTM model. As with Human3.6M, we show performance using the 3D Euclidean error metric over the 17 joints quantitatively in Table 5, and then qualitatively in Fig. 9 and in the accompanying video (The video is available at http://youtu.be/CLDqpze53lU). The table shows that our combined semantic and occupancy based fusion with IMU approach outperforms all other methods, including our previous work (Gilbert et al. 2017) by 6 mm, and the triangulated CPM by 13 mm, which also performed well on the Human3.6M. The ability of the LSTM layers to introduce the temporal consistency and remove failure cases, improves all approaches by around 20 mm.

Fig. 8
figure 8

The location of the 13 orange box IMU sensors

Table 5 Comparison of our approach on TotalCapture to other human pose estimation approaches, expressed as average per joint per frame error (mm)
Fig. 9
figure 9

Additional results across diverse poses within TotalCapture. The two skeleton results shown the joint estimates from a different camera views

Table 6 Mean per joint error (mm) of the approach components on the TotalCapture Dataset

4.4 Ablation Study

Our ablation study cumulatively enables each of our individual contributions on top of a classic baseline of a 3D Matte PVH. 3D pose estimation performance error is presented in Table 6 for separate parts of the approach.

The table shows how that the two channels of the PVH, 3D Matte PVH and 3D CPM PVH separately have a similar performance error, however by employing a two channel PVH it is possible to reduce the error by 20 mm. We also show the accuracy of using a 3 channel PVH (3D RGB Matte PVH ) with the foreground RGB pixel values instead, this performs worse, due to the increased dimensionally of the 3 channels but without the increased complementary knowledge that combining the occupancy and semantic label channels provides. With regards the IMU, the Raw IMU LSTM, uses the raw global orientation of the IMU units without an kinematic solve, an LSTM model is trained on the raw IMU input and this performs badly with nearly double the error of the Solved-IMU. Part of the reason for this higher error is likely to be due to sensor drift within the IMU being unable to be modelled correctly by the LSTM. However, through constraining the noisy IMU unit responses with inverse kinematics, we are able to negate the IMU sensor drift to some degree. By then fusing the SolvedIMU and two channel PVH, the error is further reduced. This is likely to be due to the complementary nature of the two data sources. Also, we show the result of just averaging the two data streams as the fusion method, this produces a high error, as expected as it is unable to learn anything about how the two data stream interact. It is possible to examine the per frame error for a sequence for subject 2 and sequence Acting3, in Fig. 10. Looking at the framewise errors, it shows that the two modes of data, the 3D PVH and SolvedIMU have lower errors at times, however through the use of the fusion layer, the overall error is lower than both. At around frame 1250, the Solved IMU increases in error due to a failure, however, the overall error rate of our proposed approach is relativity unchanged. While at frame 2500, the IMU is out performing the 3D PVH allowing the fused result to maintain a low error. However, at frame 4000 both modes fail, to cause higher errors in both data modes and the fused results, qualitative results of these three frames are shown in Fig. 11. For frame 4000 the higher errors can be seen to be caused by the arms not being extended correctly. The differences between the inferred poses can be quite small, indicating the contribution of all components of the approach. Although it’s important to notice that the errors in the SolvedIMU pose for frames 2800 and 4000 aren’t introduced to the final fused results. Run-time performance is 25 fps, including PVH generation. The ability of the approach to generalise between datasets is an interesting topic, therefore we compared applying a model trained on the TotalCapture dataset to the Human 3.6M dataset. We used the trained TotalCapture model from Table 6, 3D Matte CPM PVH-LSTM i.e. the input to the fusion layer (as we cant use a model that takes in IMU data on the Human3.6M dataset). Given the different number of cameras and far poorer resultant PVHs formed by human3.6M, we fine-tune the TotalCapture trained model on the human3.6M using unfixed weights with a single epoch of the Human3.6M training data (normally the model is trained with 100 epochs where an epoch is a complete pass of the training data). The new fine-tuned model was then shown all the test sequences from Human3.6M and achieved an average joint error of 75.3 mm. This is similar to the performance of our approach with exclusive training on Human3.6M of 71.9 mm as shown in Table 2. This indicates that the learnt model is similar, although a small amount of adaptation is required between the datasets due in this case to the poor PVH generalisation for the Human3.6M dataset, later in Sect. 5 we will shown results on the TotalCaptureOutdoor without any generalisation.

Fig. 10
figure 10

Per frame accuracy of our proposed approach on sequence A3 Subject2

Fig. 11
figure 11

Visual comparison of poses resolved at different pipeline stages. TotalCapture: Acting3, Subject 2

4.5 In Depth Analysis

In this section, we explore and analyse some of the parameters in the approach. We investigate the effect of the number of cameras used, the amount of training data, the number of previous frames used for the temporal consistency and the effect the size of the voxels in the PVH volume has on the overall performance.

4.5.1 Number of Cameras Used

Within the TotalCapture dataset there are 8 cameras, the greater the number of cameras, the more visually realistic the PVH is, for this work however it is possible to remove a large number of these with little or no impact on performance. The 3D PVH is constructed from the intersection of the foreground mattes and the intersection of the semantic 2D joint heat maps. With a greater number of cameras a more realistic PVH can be constructed, as can be seen comparing Fig. 12a, b, c which show the foreground matte based PVH with 8, 6, and 4 cameras respectively. While Fig. 12d shows the PVH for the Human3.6M dataset, the reason visually the Human3.6M PVH is worse in Fig. 12a–c, is probably due to the cameras being closer to the ground and also noisier foreground mattes being used, however performance isn’t greatly affected. It can be seen that the PVH is visually less realistic with fewer cameras, however as shown in Table 7, which shows the relative performance for the whole fusion system with 4,6, and 8 cameras used to construct the 2 channel visual PVH, the performance is relatively unaffected despite halving the number of cameras used.

Fig. 12
figure 12

Effect on varying camera count on qualitative PVH appearance, for TotalCapture dataset (ac) and Human3.6M (d)

Table 7 Relative accuracy change (mm/joint) when varying the number of cameras

4.5.2 Training Data Size

Generally for training neural networks a large amount of varied data is required, and the more data the higher the performance, especially as we use 3D convnets, which have an additional dimension and therefore additional weights to learn. We are able to investigate how the amount of training data affects the performance. The test sequences were kept consistent throughout as before, and an increasing percentage of total available training data was used from Subjects 1, 2 and 3, randomly sampled from maximum of \(\sim 250\)k MVV frames. Table 8 suggests that the performance is relatively unaffected by the lower amounts of training data. This can be in part due to the use of our range of motions sequences within the training set. The approach can train with a sparse set of data and doesn’t over-fit even if only 20% of the training data is used.

Table 8 Evaluating impact of accuracy (relative change in per joint mm error) as training data volume increases

4.5.3 Temporal Frame Length

Within the LSTM layers, there are memory cells that remember the previous f data instances in time to provide temporal consistency. For this work \(f=5\), which is a compromise between little or no temporal memory and too long, which would fail to generalise to the test data after training. Figure 13 shows the how the performance on the regular train and test set varies for an increasing number of previous frames used. It can be seen that initially, the error is higher when little or no previous frame information is incorporated, it then increases and slows after a minimal around 5–6 frames. This is to be expected as the approach is starting to overfit to the training data and can’t generalise to work well on the unseen test sequences.

Fig. 13
figure 13

3D Pose estimation error for increasing number of previous frames used by LSTM layers

Table 9 Relative accuracy change (mm/joint) when varying the number of voxels in the PVH
Fig. 14
figure 14

Affect of voxel sizes on qualitative PVH appearance

Fig. 15
figure 15

The cameras viewpoints of the TotalCaptureOutdoor dataset (Malleson et al. 2017)

4.5.4 Voxel Resolution

Discrete voxels are used to carve up the 3D occupied volume to produce the probabilistic visual hull and then fed into the 3D convnet, with an initial resolution of 30 \(\times \) 30 \(\times \) 30 voxels. Therefore for a 2 \(\times \) 2 \(\times \) 2 m volume each voxel being \(67\,\mathrm{mm}^{3}\), which is the error measure, and therefore could be hypothesised that this is the minimum error noise threshold. We can investigate the effect of this coarse quantisation by increasing and reducing the number of voxels. Table 9 shows the relative effect in adjusting the voxel quantity, and visually in Fig. 14.

It can be seen that there is a slight reduction in performance with larger and smaller voxels 125 mm (16 \(\times \) 16 \(\times \) 16) and 41 mm (48 \(\times \) 48 \(\times \) 48) respectively however this is to be expected as with a larger voxels, the detail is reduced, and with the smaller voxels the parameter space is exponentially increased (110,000 elements for 48 \(\times \) 48 \(\times \) 48 voxels compared to 27,000 for 30 \(\times \) 30 \(\times \) 30), and therefore unable to effectively learn the additional weight parameter without the exponential increase in training data.

5 TotalCaptureOutdoor

To further demonstrate the generalisation of the approach, we test on a new challenging dataset used by (Malleson et al. 2017), This is a MVV and IMU dataset that was recorded outdoors in challenging uncontrolled conditions with a moving and changing background and varying illumination. 6 video cameras were placed in a 120 arc around the subject, with a large 8 \(\times \) 8 m capture volume used. Examples of the camera viewpoints are shown in Fig. 15. For the TotalCaptureOutdoor sequences we uses the fully trained model (Dense Layer Fused approach from Table 6) from the TotalCapture dataset in Sect. 4.3 to predict the joints on the TotalCaptureOutdoor sequences. To indicate the generalisation ability of the approach a different camera setup (6 against the 8 on the indoor TotalCapture dataset) and the 13 Xsens IMUs were only placed in roughly similar locations to previous captures. Given the change in environment from a controlled studio to a unconstrained sunny and cloudy outdoor setting. We we able to achieve excellent qualitative performance on this more challenging dataset. There is no ground truth data is available for this dataset, however Fig. 16 shows a selection of pose estimations, for our full approach and the input image for subject 2. It can be seen that the resolved poses for our approach are able to accurately reflect the image despite all the training data being from the indoor TotalCapture dataset. Also the moving background from the tree is ignored correctly as noise by the occupancy based PVH. Finally Fig. 17, illustrates the resultant joint estimates from views taken in a \(360^\circ \) around the subject.

Fig. 16
figure 16

Visual comparison of poses resolved for the dataset TotalCaptureOutdoor for our proposed approach and the Tri-CPM

Fig. 17
figure 17

\(360^\circ \) views of a frame from TotalCapture Outdoor

It shows that despite cameras only being present on one side, we are able to be accurately estimate full \(360^\circ \) joint locations.

6 Conclusion

We have presented a novel approach for marker-less performance capture, that fuses MVV and IMU data to provide high accuracy human pose estimation in 3D. The MVV is used to produce semantic joint estimations and foreground occupancy, with a temporal model provided by LSTM layers to produce state of the art performance on the Human3.6M dataset, with a mean per joint error of 71.9 mm. Through the fusion of a forward kinematic solve from IMUs, this error can be further reduced by 10 mm beyond the state of the art. Currently the limitations of the approach are often due to poor foreground mattes that can cause the PVH to fail to accurately describe the subjects volume. Similarly, in challenging poses the 2D pose estimation can fail in a number of camera views, resulting in a poor input PVH input. However we have shown excellent qualitative results on three datasets including on a challenging outdoor dataset and are able to release the TotalCapture dataset; the first publicly available dataset simultaneously capturing MVV, IMU and skeletal ground truth.