1 Introduction

Human motion capture is widely used for computer graphics of movies and games, performance analysis in sports science and virtual reality and augmented reality applications. Furthermore, 3D human pose estimation is required not only for these specific applications but also in daily situations to develop more services.

In the past, many 3D human pose estimation methods from external cameras have been proposed [1,2,3,4,5,6,7,8,9,10,11,12,13]. The cameras are statically placed around the users. However, such an external camera setup is impractical in daily situations because of portability, space and ground conditions, and occluders in front of the subject.

3D human pose estimation from an egocentric camera perspective can enable a portable motion capture system. However, most methods detect only parts of body motion (hands or faces) due to the limitation of the normal camera field of view and the proximate setup position. To address this problem, Jiang and Grauman [13] proposed whole-body pose reconstruction from scenes observed by a chest-mounted camera, but their method lacks accuracy.

The most closely related works, Mo2Cap2 [14] and x R-EgoPose [15, 16], estimate the whole-body 3D ego pose from distorted images captured by a single fisheye camera mounted around the user’s head. The mounted fisheye camera can capture the whole user’s body in the images from the top-down view (see Fig. 1). However, these unique camera optics and setup positions give rise to a shortage of data in the training dataset for deep neural networks, which is the mainstream methodology for 3D human pose estimation. To overcome this problem, those authors generated a vast synthetic dataset (530K images in Mo2Cap2 and 380K images in x R-EgoPose).

Fig. 1
figure 1

(upper left) Real image of the Mo2Cap2 [14] setup. (upper right) Synthetic image of the x R-EgoPose [15, 16] setup. (bottom) Our omnidirectional camera setup and a real image

The above-described 3D ego-pose estimation from a mounted fisheye camera obtains good accuracy but still shows some problems. First, a single fisheye camera still has a limited field of view. Therefore, the estimation fails when some parts of the body lie outside of this field of view. In fact, Mo2Cap2 and x R-EgoPose setups cannot characterize the user’s hands when the hands are placed on the head. Second, a mounted fisheye camera captures notably different images due to the effects of factors such as the optical properties of the lens, angle of views, and setup positions. Therefore, Mo2Cap2 and x R-EgoPose must be trained for each set of camera optics and setup. To address these difficulties, we propose a 3D ego-pose estimation from a single mounted omnidirectional camera and a lift-up model customized to our camera setup.

The omnidirectional camera captures the entire circumference by back-to-back dual fisheye cameras. The 360 field of view captures the user’s body under a wider variety of motions than a single fisheye camera and provides flexibility of setup positions. However, the captured images face not only the problem of distortion but also the problem of disconnection of view. For example, the omnidirectional camera captures a disconnected arm in the image when the shoulder and hand are in different lens fields of view. In our hardware setup, the omnidirectional camera is set in the front space of the user’s chest (see Fig. 1).

A large-scale training dataset is necessary to learn the image features (distortion and disconnection) for each camera optical component and setup. However, our unique hardware optics and setup make it difficult to collect a large-scale dataset. To overcome the shortage of data, we generate a large-scale synthetic training dataset. We also collect a small-scale real evaluation dataset. Our dataset consists of 151K synthetic and 5K real images with 2D/3D pose annotations and is publicly available.

We propose a simple feed-forward network model to estimate 3D joint positions from 2D joint locations that works with a 2D joint location estimator. The lift-up model is customized to 3D ego-pose estimation from a mounted omnidirectional camera. First, we convert input 2D joint locations to 3D unit vectors toward 3D joint positions from the camera position. 3D unit vectorization is implemented using omnidirectional camera calibration toolbox ocamcalib [17] and our camera parameters. Second, we design a new vector- and distance-based loss function (VD loss). The reasonable customized model can be used in real time yet shows accuracy comparable to those of the previous methods on our new dataset.

Our main contribution is that the proposed pipeline approach separates the 2D-3D dependency of 3D human pose estimation in terms of both intrinsic parameters and the data collection pipeline in the field of omnidirectional cameras. Acquiring a large number of in-the-wild images with 2D/3D pose annotations for the egocentric omnidirectional perspective is still a time-consuming task even if this capability is available in a professional motion capture system. In our pipeline approach, the 3D unit vectorization module confines the impact of camera optics by the module parameters according to the intrinsic characteristics of the omnidirectional camera. Therefore, our lift-up model is trainable with ground truth 3D joint positions and unit vectors that are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in camera optics and setups, although its use is limited to after the 2D joint location estimation.

Our contributions are summarized as follows:

  • We propose a simple lift-up model customized to 3D ego-pose estimation from a single mounted omnidirectional camera. The model can be used in real time yet obtains accuracy comparable to those of the previous works on our dataset.

  • In our pipeline approach, the model is trainable with ground truth 3D joint positions and the unit vectors that are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in camera optics and setups.

  • We build a new large-scale synthetic and real dataset in our omnidirectional camera setup. The dataset consisting of 151K synthetic and 5K real images with 2D/3D pose annotations is publicly available.

2 Related work

We discuss monocular 3D human pose estimation methods focusing on the following camera setups: an external camera that captures the subjects from a distance; a mounted camera that captures the subjects from the egocentric perspective; and a mounted fisheye or omnidirectional camera that captures the subjects from a wider field of view.

2.1 3D pose estimation from an external camera

Convolutional neural networks and large-scale 2D and 3D datasets have recently enabled advances in 3D pose estimation from the images captured by a single camera or multiple cameras [1,2,3]. In monocular 3D pose estimation, two main approaches have emerged: (1) direct regression approaches to 3D joint positions from images [4,5,6,7,8,9,10] and (2) pipeline approaches that decouple the problem into the tasks of 2D joint location estimation and subsequent 3D lift-up [11, 12].

The accuracy and generalization in direct regression approaches are severely affected by the availability of 3D pose annotations for in-the-wild images. The two-step decoupled approaches have two advantages: (1) the availability of high-quality off-the-shelf 2D joint location estimators that require only easy-to-harvest 2D annotations [18,19,20,21] and (2) the possibility of training the 3D lifting-up step using 3D mocap datasets and their ground truth 2D projections without images.

Martinez et al. [11] indicated that even simple architectures solve the lifting-up task with a low error rate. We propose a simple pipeline approach customized to 3D ego-pose estimation for a mounted omnidirectional camera.

2.2 3D pose estimation from an egocentric perspective

Self-3D pose estimation from a mobile mounted camera has been demanded for daily activity recognition in recent years [22, 23]. However, most methods detect only parts of body motion (hands or faces) because the limitation of field of view and proximate setup positions make it quite challenging to capture whole body pose.

Jiang and Grauman [13] proposed whole-body pose reconstruction from scenes observed by a chest-mounted camera even though the performance lacks accuracy and certainty.

Ahuja et al. [24] proposed a low-cost VR/AR headset that was composed of a pair of hemispherical mirrors and a smartphone camera. The system obtains 2D human poses from images on hemispherical mirrors using OpenPose [25]. Subsequently, the authors applied the 2D pose results to the 3D human pose using differences of mirror viewpoints.

2.3 3D ego-pose estimation from mounted fisheye cameras

The first approach toward direct whole body pose estimation from the egocentric fisheye camera was proposed by Rhodin et al. [26]. A stereo fisheye camera pair was placed at a distance of approximately 25 cm from the user’s head using telescopic sticks mounted on a helmet. Although the wide field of view captures most of the body, the setup is fairly cumbersome for the users.

Lightweight monocular fisheye camera approaches were proposed by Xu et al. [14] and Tome et al. [15, 16]. The camera is placed in front of the user’s forehead using a baseball cap and a head-mounted display. Both of these works generated large-scale synthetic training datasets to solve the problem of the shortage of data because of the unique top-down view including distortion by fisheye camera optics.

Xu et al. [14] proposed a direct regression model (Mo2Cap2) of 3D unit vectors and distances to 3D joint positions from the camera position. The 3D unit vectors are obtained by estimated 2D joint locations and the omnidirectional camera calibration toolbox ocamcalib [17].

Tome et al. [15, 16] proposed a pipeline approach using a multibranch encoder-decoder model (x R-EgoPose) to estimate 3D joint positions from 2D joint location heatmaps. The lift-up model can be separately trained using 3D mocap datasets and their ground truth 2D heatmaps without raw image pixels. However, the model requires the conversion of 3D pose annotations to 2D heatmaps corresponding to changes in the camera optics and setups.

Miura and Sako [27] proposed a single omnidirectional camera approach placed in front of the user’s neck. They validated that the location map method [9] can estimate 3D ego-pose from images including not only distortion but also disconnection of view. However, the method of Miura and Sako [27] handles only the upper body joints, and the input image is converted to equirectangular projection.

Zhang et al. [28] proposed an automatic calibration method to improve the accuracy of 3D joint positions by predicting the intrinsic parameters of omnidirectional cameras. This automatic calibration method still depends on the diversity of optical setups in the dataset because the camera intrinsic parameter estimation is learned using 2D/3D joint position dependency. Therefore, previous works necessarily trained the model for each set of camera optics and setup.

3 Approach

We propose a 3D ego-pose estimation from a single mounted omnidirectional camera that is composed of back-to-back dual fisheye cameras. We set the omnidirectional camera at a distance of approximately half the shoulder width from the user’s chest using a telescopic stick mounted on the body (see Fig. 1). The camera is lightweight (27 g), is small (diameter of 37.6 mm), and has a wide field of view (210 field of view for each fisheye lens). Our hardware setup is portable and captures the user’s body under a wide variety of motions.

We generate a large-scale training dataset of synthetic images with ground truth 2D/3D pose annotations in our unique setup. We also collect a small-scale real evaluation dataset.

The camera optical properties and proximate setup position capture images, including distortion and disconnection of view. Furthermore, the captured images vary drastically due to the effects of factors such as the optical properties of the lens, angle of views, and setup positions. The variety of image features makes 3D ego-pose estimation using the convolutional neural network challenging.

We address the challenges by a pipeline approach using a lift-up model customized to our setup. Our model is trainable with ground truth 3D joint positions and unit vectors that are easily generated from the existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in the camera optics and setups, although it is limited after the 2D joint location estimation.

3.1 Synthetic training dataset

We present a large-scale training dataset in our unique hardware setup. Acquiring a large amount of annotated 3D pose data is a mammoth task for external camera setups and is even more difficult for the egocentric perspective. Furthermore, acquiring a large number of in-the-wild images with 2D/3D pose annotations for the egocentric omnidirectional perspective is a time-consuming task even if it is available in a professional motion capture system.

We alleviate the difficulties by rendering a synthetic human body model from a virtual mounted omnidirectional camera perspective. To acquire a large variety of training data, we build a dataset based on the large-scale synthetic human dataset SURREAL [29]. We animate the human model using the SMPL body model [30] with sampled motions from the CMU MoCap dataset.Footnote 1 Body textures are randomly chosen from the texture dataset provided by SURREAL.

To generate realistic images, we simulate the camera and background in a real-world scenario. The virtual camera is placed at a similar position as our hardware setup. The camera randomly perturbates the position in each rendering because the position moves slightly and is affected by body movements and poses in the real world. Specifically, the camera is placed at shoulderwidth × 0.45 front of the center point of the shoulder line. Subsequently, the camera position perturbates to move in 3D space according to a normal distribution N(σ2 = shoulderwidth × 0.1). We apply the intrinsic camera parameters that are obtained from the real omnidirectional camera using the omnidirectional camera calibration toolbox ocamcalib [17]. The rendered images are augmented with the backgrounds randomly chosen from 50 indoor and 54 outdoor images captured by a real omnidirectional camera.

Our synthetic training dataset contains the ground truth annotations of 2D/3D joint positions that are easily generated using the 3D mocap data and the camera calibration toolbox. We acquire the following 18 body joints: head, neck, spine, pelvis, shoulders, elbows, wrists, hands, hips, knees, and ankles. The 3D joint positions are incorporated into the omnidirectional camera coordinate system for our hardware setup. We collect a total of 151,280 synthetic images with ground truth 2D/3D pose annotations for our large-scale training dataset. The examples are described in Fig. 2.

Fig. 2
figure 2

Indoor and outdoor synthetic images with ground truth 2D/3D pose annotations

3.2 Real evaluation dataset

We collect a dataset in real situations for quantitative evaluation. We record 2 people with general clothing indoors and outdoors for 5 actions (boxing, dancing, hands up, sitting, and walking).

The ground truth 3D joint positions are recorded using a commercial external RGB-D cameraFootnote 2 and 3D skeleton tracking software.Footnote 3 We simultaneously record the mounted omnidirectional camera images and the 3D joint positions obtained by the RGB-D camera and the tracking software. The 3D joint positions are postprocessed to be in the omnidirectional camera coordinate system using the fixed omnidirectional camera position. The ground truth 2D joint locations are converted from the 3D joint positions using the camera calibration toolbox.

We use different 3D mocap systems for the synthetic training dataset and the real evaluation dataset. Therefore, the 3D skeleton structure and scale are slightly different. To reduce the gap, we normalize the 3D joint positions of both datasets using the following 3 steps while maintaining the skeleton shape: (1) Move to shrink joint positions for the shoulder width to be 1.0; (2) rotate joint positions for the shoulder line to be horizontal; and (3) rotate joint positions for the line between the neck and pelvis to be vertical.

3.3 Vector- and distance-based lift-up model

We propose a lift-up model that is a simple feed-forward network to estimate 3D joint positions from 2D joint locations that works with a 2D joint location estimator in the pipeline approach. Our lift-up model is mainly based on Martinez et al. [11] and involves batch normalization, dropout, ReLU, residual connections [31], and training with the max-norm constraint for high accuracy and generalization. Additionally, our lift-up model is customized to 3D ego-pose estimation from a mounted omnidirectional camera setup as described below.

  • 3D Unit Vectorization: We convert the estimated 2D joint locations on the input plane to 3D unit vectors toward each 3D joint position in the omnidirectional camera coordinate system. The 3D unit vectorization is implemented using the omnidirectional camera calibration toolbox ocamcalib [17] and our camera parameters. The information quantity of the input is enhanced in 3D space based on the 2D plane.

  • Vector and Distance Loss Function: Martinez et al. [11] trained the model with the L2 loss function. In the omnidirectional camera coordinate system, a 3D joint position can be decomposed into a 3D unit vector and a distance (magnitude of the vector). We design a new loss function based on the vector and the distance (VD loss) inspired by the decomposable feature and 3D unit vector input of the lift-up model:

    $$ \begin{array}{@{}rcl@{}} \text{VDLoss}({\boldsymbol P}_{j}) = \lambda_{\theta} {\theta}\!\left( {\boldsymbol P}_{j}^{GT}, {\boldsymbol P}_{j}\right) + \lambda_{d} \mathrm{D}\!\left( \|{\boldsymbol P}_{j}^{GT}\|, \|{\boldsymbol P}_{j}\|\right) \end{array} $$

    for a 3D joint position Pj where GT means the ground truth. The loss function is composed of the cosine-similarity error and the distance error:

    $$ \begin{array}{@{}rcl@{}} {\theta}({\boldsymbol x}, {\boldsymbol y}) = 1 - \frac{{\boldsymbol x} \cdot {\boldsymbol y}}{\|{\boldsymbol x}\| \times \|{\boldsymbol y}\|}, & & \mathrm{D}(x, y) = \lvert x - y \rvert. \end{array} $$

    We determine the coefficients λ𝜃 = 1.0 and λd = 0.1 based on the results of a grid search. The VD loss regresses the input 3D unit vectors to 3D joint positions with the constraint of cosine-similarity error of vectors.

We describe the pipeline process to estimate the 3D ego-pose from the mounted omnidirectional camera images in Fig. 3. Our lift-up model can flexibly adjust to model complexity and size according to the number of weights w and the number of residual blocks b. The most important advantage of our model is that it is separately trainable with ground truth 3D joint positions and unit vectors.

Fig. 3
figure 3

Pipeline process of 3D ego-pose estimation from mounted omnidirectional camera images. Our lift-up model inputs 3D unit vectors converted from estimated 2D joint locations. The parameter w of the fully connected layer indicates the number of in/out weights except for the first and final layers. The parameter b of residual blocks indicates the number of iterations. If the parameter b = 0, our lift-up model has only an input basic block and the final fully connected layer without residual blocks

4 Evaluation

We quantitatively evaluate our lift-up model on the synthetic training dataset and the real evaluation dataset. We sample 18,910 images from the synthetic training dataset because of our poor computational resources. We use the mean joint position error (MJPE) and the percentage of correct keypoints (PCK) as the evaluation metrics. The error is the Euclidean distance between the estimation and the ground truth of a 3D joint position. In the evaluation, we rescale the normalized 3D joint positions to those of the real measurements using the shoulder width (mm) of the evaluation dataset.

4.1 Implementation and training details

Our pipeline approach requires the use of a 2D joint location estimator in the first step. The competing direct regression approaches can be internally decoupled to the 2D module and the 3D module in the model. The 2D module estimates 2D joint locations by heatmap regression. The 3D module extends the 2D module results to 3D joint positions using each model’s method. In the evaluation, we apply the competing 2D module results to our 3D unit vectorization input for fair comparison.

We first train the competing direct regression models Mo2Cap2 [14] and VNect [9]. The competing models based on ResNet50 [31] output 32 × 64 pixel heatmaps in their 2D module from images with a resolution of 256 × 512 pixels. We pretrain the 2D module on a 2D pose estimation task using the MPII Human Pose dataset [32]. The models can learn good low-level features from real images with normal camera optics. Subsequently, we fine-tune the pretrained models on our synthetic training dataset to learn omnidirectional camera optics for the 2D/3D modules. Fine-tuning is scheduled to be carried out for 32 batches and 70 epochs. We use the Adam optimizer and the initial learning rate of 0.05. We decrease the learning rate to 0.001 for the initial 13 residual blocks of the models to preserve the low-level features learned from the real images.

Our lift-up model is trained with 2D joint locations estimated by the competing 2D module on the synthetic training dataset. The training is carried out with 32 batches, 70 epochs, using the Adam optimizer, and with the initial learning rate set to 0.001. We determine the model’s architecture for 8 weights (w = 8) and 0 residual blocks (b = 0) in this evaluation because of the ablation study results. Our lift-up model has only 2 fully connected layers without residual blocks.

4.2 Comparison with previous works

We quantitatively compare our lift-up model to the related previous works on the real evaluation dataset. We compare the results to those of Mo2Cap2 [14] and x R-EgoPose [15, 16], which are 3D ego-pose estimations from a single mounted fisheye camera. We use a dual-branch encoder-decoder model with 3D pose and heatmaps as x R-EgoPose in this comparison. Additionally, we compare to VNect [9], which was validated for a mounted omnidirectional camera setup by Miura and Sako [27]. The decoupled two-step approaches (x R-EgoPose and ours) are trained and evaluated for each 2D module results of Mo2Cap2 and VNect.

We present the MJPE(mm) and PCK@30 mm results in Table 1. Our lift-up model shows worse performance than x R-EgoPose in MJPE(mm) even though it obtains the best accuracy in PCK@30 mm. These results mean that our model achieves high accuracy when the 2D joint location estimator works well. However, the deterioration of 2D joint location estimation has a worse impact on our lift-up model because the robustness is less than x R-EgoPose. Additionally, our lift-up model obtains higher accuracy with the 2D module of VNect and Mo2Cap2 than the original models.

Table 1 MJPE (mm) results on a real evaluation dataset for comparison with previous works. PCK@30 mm is also indicated in brackets

We indicate the model parameter size and the estimated required time on a CPU (Intel Xeon @ 2.20 GHz) in Table 2. We indicate 2D module results that exclude the 3D module from the models for VNect and Mo2Cap2 because the models contain the whole process to estimate 3D joint positions from the input image. Our lift-up model shows better parameter size and execution time even when combined with 2D modules.

Table 2 Model parameter size and estimated time on CPU (Intel Xeon @ 2.20 GHz) for previous works and our model. We indicate 2D module results as well in brackets for VNect and Mo2Cap2

We present examples of 3D ego-pose estimation results by our lift-up model with VNect’s 2D module in Fig. 4. The skeleton structures of estimation and ground truth are slightly different because the 3D joint positions of the synthetic training dataset and the real evaluation dataset are acquired by different mocap systems.

Fig. 4
figure 4

Examples of 3D ego-pose estimation results obtained by our lift-up model with VNect’s 2D module. (left column) Input mounted omnidirectional camera images. (center and right column) Ground truth and estimated 3D joint positions from different angles

5 Ablation study

To further analyze our approach, we evaluate some additional aspects of our lift-up model, namely, the accuracy and generalization of the model parameters (w and b), effectiveness of 3D unit vectorization and VD loss function, and the possibility of training with ground truth 3D pose annotations. Additionally, we evaluate our model on the Mo2Cap2 dataset [14].

5.1 Model parameters

We describe the MJPE(mm) results and the standard deviations of our lift-up model in Fig. 5. The model parameters are changed as follows: the number of weights w is changed from 4 to 16, and the number of residual blocks b is changed from 0 to 3. We note that VNect’s 2D module provides better 2D joint locations than Mo2Cap2’s 2D module.

Fig. 5
figure 5

MJPE (mm) results and the standard deviations of our lift-up model based on (left) VNect’s 2D module and (right) Mo2Cap2’s 2D module. The model parameters are changed as follows: the number of weights w is changed from 4 to 16, and the number of residual blocks b is changed from 0 to 3. We indicate the markers to represent the number of weights for visibility of standard deviations

The larger weights model estimates 3D joint positions with higher accuracy even though the increased number of weights leads to a deterioration of the generalization for worse 2D joint locations estimated by Mo2Cap2’s 2D module because of overfitting to training dataset. For the residual blocks, increasing the number of residual blocks slightly improves accuracy and generalization in the appropriate weights model.

In the case of our training and evaluation datasets, we determine our model parameters for 8 weights (w = 8) and 0 residual blocks (b = 0) based on the aspects of accuracy, generalization, and model size.

5.2 3D unit vectorization and VD loss function

We present the MJPE(mm) and PCK@30 mm results of Martinez et al. [11] and our models in Table 3 for the effectiveness analysis of 3D unit vectorization and the VD loss function.

Table 3 MJPE (mm) results of Martinez et al. and our lift-up models. PCK@30 mm is also indicated in brackets

The Martinez et al. model inputs only the estimated 2D joint locations and is trained with the L2 loss function. In our pipeline approach, we convert the 2D joint locations to 3D unit vectors for our lift-up model input. The 3D unit vectorization provides better accuracy for both 2D module base results. Comparison with L2 loss and VD loss in our model results shows that our VD loss also provides better estimation results.

5.3 Ground truth training

We present the MJPE(mm) and PCK@30 mm results in Table 4. Our lift-up model is trained with the ground truth 3D joint positions and the unit vectors on the synthetic training dataset. Additionally, we train our model with an exclusive synthetic dataset that is unused in the training session of the 2D joint location estimator, even though it is collected in synthetic data generation. The ground truth 3D joint positions and the unit vectors are easily generated from 3D pose annotations.

Table 4 MJPE (mm) results of our lift-up models trained with estimated 2D joint locations, ground truth 3D pose annotations, and ground truth 3D pose annotations of exclusive synthetic dataset. PCK@30 mm is also indicated in brackets

The training using 2D estimation results performs better in general because the model can learn robustness to the 2D joint locations estimator’s error. However, our lift-up model shows comparable accuracy even for the ground truth dataset that is not used in the 2D module training session.

5.4 Evaluation on the Mo2Cap2 Dataset

We compare our lift-up model with previous works [8, 9, 14,15,16] on the Mo2Cap2 dataset. We cannot obtain the estimated 3D unit vectors of the training dataset because the Mo2Cap2 dataset does not provide the fisheye camera intrinsic parameters. Therefore, we use ground truth 3D joint positions and the unit vectors on the training dataset.

We train our model on the ground truth training dataset. We customize the model parameters 2 residual blocks (b = 2) and 4096 weights (w = 4096). The model parameter size is 67.5 M, and the execution time is 16.592 (ms / frame) on a CPU. The training is carried out with λ𝜃 = 1.0, λd = 0.00001, 1024 batches, and 5K epochs.

We show the MJPE (mm) results in Table 5. Our model is worse than Mo2Cap2 in total average, although the performance is better than Mo2Cap2 results in some actions. We present examples of estimation results in Fig. 6.

Table 5 MJPE (mm) results on the Mo2Cap2 dataset for comparison with previous works
Fig. 6
figure 6

Examples of estimation results on the Mo2Cap2 dataset, (top) input image, (middle) ground truth, and (bottom) estimation result

6 Discussion

6.1 Dataset restriction

Our synthetic and real datasets have two restrictions: (1) the fixed omnidirectional camera position and (2) the skeleton normalization while recording. These restrictions make 3D ego-pose estimation an easier problem. However, in the scene of wearable devices in daily use, the setup position is usually determined for each purpose, for example a head mounted display, a smart glass, or a microphone headset. For skeleton normalization, the human pose is more important than the real-scale 3D joint positions that can be computed by postprocessing with user data. For these reasons, our dataset restrictions are acceptable in practical application scenarios.

6.2 Model performance and size

Our lift-up model is flexibly adjustable to the performance of the 2D joint location estimator according to the model parameters of weights and residual blocks. Specifically, increasing the number of weights and residual blocks followed with the 2D estimator’s quality improves the accuracy of 3D joint position estimation.

Another advantage is that our model size does not increase with increasing input image resolution. Enlargement of input images is the most direct approach for improving the accuracy of 2D/3D joint position estimation. However, this method leads to an increase in the parameters for the convolutional neural networks such as VNect [9], Mo2Cap2 [14], and x R-EgoPose [15, 16]. Our model can restrain the increase in the model parameters with enlargement of the input images even though it is limited after 2D joint location estimation.

6.3 Separately training the 2D/3D module

Our lift-up model is trainable with ground truth 3D joint positions and unit vectors. This advantage enables the separate training of the 2D joint location estimator and our model.

x R-EgoPose is also separately trainable with the ground truth 3D mocap dataset and the 2D projection. However, the 2D joint location heatmaps are converted differently according to the omnidirectional camera properties and setups. The 2D-3D dependency requires data collection and model training for each camera property and setup.

Our pipeline approach separates the 2D-3D dependency of 3D human pose estimation in terms of both intrinsic parameters and the data collection pipeline in the field of omnidirectional cameras. Specifically, the 3D unit vectorization module confines the impact of camera optics by the module parameters according to the omnidirectional camera intrinsic. Therefore, our model is separately trainable with the ground truth 3D joint positions and the unit vectors that are easily generated from the existing publicly available 3D mocap datasets.

7 Conclusion

We proposed a 3D ego-pose estimation from a single mounted omnidirectional camera that captures the entire circumference by back-to-back dual fisheye cameras. The 360 field of view captures the user’s body under a wide variety of motions and provides flexibility for the camera setup position.

We built a new large-scale synthetic training dataset and a real evaluation dataset using our unique omnidirectional camera setup. The dataset contains 151K synthetic and 5K real images with ground truth 2D/3D pose annotations and is publicly available.

We proposed a pipeline approach using a lift-up model to estimate the 3D joint positions from 2D joint locations. The model works with a 2D joint locations estimator and is customized to the mounted omnidirectional camera setup: 3D unit vectorization from 2D joint locations and training with vector- and distance-based loss function (VD loss). The reasonable customized model can be used in real time yet shows accuracy comparable to those of the previous methods on our dataset.

The most important advantage of our pipeline approach is to separate the 2D-3D dependency of 3D human pose estimation in terms of both intrinsic parameters and the data collection pipeline in the field of omnidirectional cameras. Our lift-up model is not required to learn input image features that are drastically different due to the effect of factors such as the optical properties of the lens, angle of views, and setup positions because the 3D unit vectorization module confines the impact of camera optics by the module parameters according to the omnidirectional camera intrinsic. Therefore, our model is separately trainable with ground truth 3D joint positions and unit vectors that are easily generated from the existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden corresponding to the change of the camera optics and setups although it is limited after the 2D joint location estimation.