Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera

Miura, Teppei; Sako, Shinji

doi:10.1007/s10489-022-03417-3

Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera

Open access
Published: 11 May 2022

Volume 53, pages 2616–2628, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera

Download PDF

2153 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Following the advances in convolutional neural networks and synthetic data generation, 3D egocentric body pose estimations from a mounted fisheye camera have been developed. Previous works estimated 3D joint positions from raw image pixels and intermediate supervision during the process. The mounted fisheye camera captures notably different images that are affected by the optical properties of the lens, angle of views, and setup positions. Therefore, 3D ego-pose estimation from a mounted fisheye camera must be trained for each set of camera optics and setup. We propose a 3D ego-pose estimation from a single mounted omnidirectional camera that captures the entire circumference by back-to-back dual fisheye cameras. The omnidirectional camera can capture the user’s body in the 360^∘ field of view under a wide variety of motions. We also propose a simple feed-forward network model to estimate 3D joint positions from 2D joint locations. The lift-up model can be used in real time yet obtains accuracy comparable to those of previous works on our new dataset. Moreover, our model is trainable with the ground truth 3D joint positions and the unit vectors toward the 3D joint positions, which are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in the camera optics and setups, although it is limited to the effect after the 2D joint location estimation.

Beyond Weak Perspective for Monocular 3D Human Pose Estimation

Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation

Article Open access 08 September 2018

UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Human motion capture is widely used for computer graphics of movies and games, performance analysis in sports science and virtual reality and augmented reality applications. Furthermore, 3D human pose estimation is required not only for these specific applications but also in daily situations to develop more services.

In the past, many 3D human pose estimation methods from external cameras have been proposed [1,2,3,4,5,6,7,8,9,10,11,12,13]. The cameras are statically placed around the users. However, such an external camera setup is impractical in daily situations because of portability, space and ground conditions, and occluders in front of the subject.

3D human pose estimation from an egocentric camera perspective can enable a portable motion capture system. However, most methods detect only parts of body motion (hands or faces) due to the limitation of the normal camera field of view and the proximate setup position. To address this problem, Jiang and Grauman [13] proposed whole-body pose reconstruction from scenes observed by a chest-mounted camera, but their method lacks accuracy.

The most closely related works, Mo²Cap² [14] and x R-EgoPose [15, 16], estimate the whole-body 3D ego pose from distorted images captured by a single fisheye camera mounted around the user’s head. The mounted fisheye camera can capture the whole user’s body in the images from the top-down view (see Fig. 1). However, these unique camera optics and setup positions give rise to a shortage of data in the training dataset for deep neural networks, which is the mainstream methodology for 3D human pose estimation. To overcome this problem, those authors generated a vast synthetic dataset (530K images in Mo²Cap² and 380K images in x R-EgoPose).

The above-described 3D ego-pose estimation from a mounted fisheye camera obtains good accuracy but still shows some problems. First, a single fisheye camera still has a limited field of view. Therefore, the estimation fails when some parts of the body lie outside of this field of view. In fact, Mo²Cap² and x R-EgoPose setups cannot characterize the user’s hands when the hands are placed on the head. Second, a mounted fisheye camera captures notably different images due to the effects of factors such as the optical properties of the lens, angle of views, and setup positions. Therefore, Mo²Cap² and x R-EgoPose must be trained for each set of camera optics and setup. To address these difficulties, we propose a 3D ego-pose estimation from a single mounted omnidirectional camera and a lift-up model customized to our camera setup.

The omnidirectional camera captures the entire circumference by back-to-back dual fisheye cameras. The 360^∘ field of view captures the user’s body under a wider variety of motions than a single fisheye camera and provides flexibility of setup positions. However, the captured images face not only the problem of distortion but also the problem of disconnection of view. For example, the omnidirectional camera captures a disconnected arm in the image when the shoulder and hand are in different lens fields of view. In our hardware setup, the omnidirectional camera is set in the front space of the user’s chest (see Fig. 1).

A large-scale training dataset is necessary to learn the image features (distortion and disconnection) for each camera optical component and setup. However, our unique hardware optics and setup make it difficult to collect a large-scale dataset. To overcome the shortage of data, we generate a large-scale synthetic training dataset. We also collect a small-scale real evaluation dataset. Our dataset consists of 151K synthetic and 5K real images with 2D/3D pose annotations and is publicly available.

We propose a simple feed-forward network model to estimate 3D joint positions from 2D joint locations that works with a 2D joint location estimator. The lift-up model is customized to 3D ego-pose estimation from a mounted omnidirectional camera. First, we convert input 2D joint locations to 3D unit vectors toward 3D joint positions from the camera position. 3D unit vectorization is implemented using omnidirectional camera calibration toolbox ocamcalib [17] and our camera parameters. Second, we design a new vector- and distance-based loss function (VD loss). The reasonable customized model can be used in real time yet shows accuracy comparable to those of the previous methods on our new dataset.

Our main contribution is that the proposed pipeline approach separates the 2D-3D dependency of 3D human pose estimation in terms of both intrinsic parameters and the data collection pipeline in the field of omnidirectional cameras. Acquiring a large number of in-the-wild images with 2D/3D pose annotations for the egocentric omnidirectional perspective is still a time-consuming task even if this capability is available in a professional motion capture system. In our pipeline approach, the 3D unit vectorization module confines the impact of camera optics by the module parameters according to the intrinsic characteristics of the omnidirectional camera. Therefore, our lift-up model is trainable with ground truth 3D joint positions and unit vectors that are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in camera optics and setups, although its use is limited to after the 2D joint location estimation.

Our contributions are summarized as follows:

We propose a simple lift-up model customized to 3D ego-pose estimation from a single mounted omnidirectional camera. The model can be used in real time yet obtains accuracy comparable to those of the previous works on our dataset.
In our pipeline approach, the model is trainable with ground truth 3D joint positions and the unit vectors that are easily generated from existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in camera optics and setups.
We build a new large-scale synthetic and real dataset in our omnidirectional camera setup. The dataset consisting of 151K synthetic and 5K real images with 2D/3D pose annotations is publicly available.

2 Related work

We discuss monocular 3D human pose estimation methods focusing on the following camera setups: an external camera that captures the subjects from a distance; a mounted camera that captures the subjects from the egocentric perspective; and a mounted fisheye or omnidirectional camera that captures the subjects from a wider field of view.

2.1 3D pose estimation from an external camera

Convolutional neural networks and large-scale 2D and 3D datasets have recently enabled advances in 3D pose estimation from the images captured by a single camera or multiple cameras [1,2,3]. In monocular 3D pose estimation, two main approaches have emerged: (1) direct regression approaches to 3D joint positions from images [4,5,6,7,8,9,10] and (2) pipeline approaches that decouple the problem into the tasks of 2D joint location estimation and subsequent 3D lift-up [11, 12].

The accuracy and generalization in direct regression approaches are severely affected by the availability of 3D pose annotations for in-the-wild images. The two-step decoupled approaches have two advantages: (1) the availability of high-quality off-the-shelf 2D joint location estimators that require only easy-to-harvest 2D annotations [18,19,20,21] and (2) the possibility of training the 3D lifting-up step using 3D mocap datasets and their ground truth 2D projections without images.

Martinez et al. [11] indicated that even simple architectures solve the lifting-up task with a low error rate. We propose a simple pipeline approach customized to 3D ego-pose estimation for a mounted omnidirectional camera.

2.2 3D pose estimation from an egocentric perspective

Self-3D pose estimation from a mobile mounted camera has been demanded for daily activity recognition in recent years [22, 23]. However, most methods detect only parts of body motion (hands or faces) because the limitation of field of view and proximate setup positions make it quite challenging to capture whole body pose.

Jiang and Grauman [13] proposed whole-body pose reconstruction from scenes observed by a chest-mounted camera even though the performance lacks accuracy and certainty.

Ahuja et al. [24] proposed a low-cost VR/AR headset that was composed of a pair of hemispherical mirrors and a smartphone camera. The system obtains 2D human poses from images on hemispherical mirrors using OpenPose [25]. Subsequently, the authors applied the 2D pose results to the 3D human pose using differences of mirror viewpoints.

2.3 3D ego-pose estimation from mounted fisheye cameras

The first approach toward direct whole body pose estimation from the egocentric fisheye camera was proposed by Rhodin et al. [26]. A stereo fisheye camera pair was placed at a distance of approximately 25 cm from the user’s head using telescopic sticks mounted on a helmet. Although the wide field of view captures most of the body, the setup is fairly cumbersome for the users.

Lightweight monocular fisheye camera approaches were proposed by Xu et al. [14] and Tome et al. [15, 16]. The camera is placed in front of the user’s forehead using a baseball cap and a head-mounted display. Both of these works generated large-scale synthetic training datasets to solve the problem of the shortage of data because of the unique top-down view including distortion by fisheye camera optics.

Xu et al. [14] proposed a direct regression model (Mo²Cap²) of 3D unit vectors and distances to 3D joint positions from the camera position. The 3D unit vectors are obtained by estimated 2D joint locations and the omnidirectional camera calibration toolbox ocamcalib [17].

Tome et al. [15, 16] proposed a pipeline approach using a multibranch encoder-decoder model (x R-EgoPose) to estimate 3D joint positions from 2D joint location heatmaps. The lift-up model can be separately trained using 3D mocap datasets and their ground truth 2D heatmaps without raw image pixels. However, the model requires the conversion of 3D pose annotations to 2D heatmaps corresponding to changes in the camera optics and setups.

Miura and Sako [27] proposed a single omnidirectional camera approach placed in front of the user’s neck. They validated that the location map method [9] can estimate 3D ego-pose from images including not only distortion but also disconnection of view. However, the method of Miura and Sako [27] handles only the upper body joints, and the input image is converted to equirectangular projection.

Zhang et al. [28] proposed an automatic calibration method to improve the accuracy of 3D joint positions by predicting the intrinsic parameters of omnidirectional cameras. This automatic calibration method still depends on the diversity of optical setups in the dataset because the camera intrinsic parameter estimation is learned using 2D/3D joint position dependency. Therefore, previous works necessarily trained the model for each set of camera optics and setup.

3 Approach

We propose a 3D ego-pose estimation from a single mounted omnidirectional camera that is composed of back-to-back dual fisheye cameras. We set the omnidirectional camera at a distance of approximately half the shoulder width from the user’s chest using a telescopic stick mounted on the body (see Fig. 1). The camera is lightweight (27 g), is small (diameter of 37.6 mm), and has a wide field of view (210^∘ field of view for each fisheye lens). Our hardware setup is portable and captures the user’s body under a wide variety of motions.

We generate a large-scale training dataset of synthetic images with ground truth 2D/3D pose annotations in our unique setup. We also collect a small-scale real evaluation dataset.

The camera optical properties and proximate setup position capture images, including distortion and disconnection of view. Furthermore, the captured images vary drastically due to the effects of factors such as the optical properties of the lens, angle of views, and setup positions. The variety of image features makes 3D ego-pose estimation using the convolutional neural network challenging.

We address the challenges by a pipeline approach using a lift-up model customized to our setup. Our model is trainable with ground truth 3D joint positions and unit vectors that are easily generated from the existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden due to changes in the camera optics and setups, although it is limited after the 2D joint location estimation.

3.1 Synthetic training dataset

We present a large-scale training dataset in our unique hardware setup. Acquiring a large amount of annotated 3D pose data is a mammoth task for external camera setups and is even more difficult for the egocentric perspective. Furthermore, acquiring a large number of in-the-wild images with 2D/3D pose annotations for the egocentric omnidirectional perspective is a time-consuming task even if it is available in a professional motion capture system.

We alleviate the difficulties by rendering a synthetic human body model from a virtual mounted omnidirectional camera perspective. To acquire a large variety of training data, we build a dataset based on the large-scale synthetic human dataset SURREAL [29]. We animate the human model using the SMPL body model [30] with sampled motions from the CMU MoCap dataset.^{Footnote 1} Body textures are randomly chosen from the texture dataset provided by SURREAL.

To generate realistic images, we simulate the camera and background in a real-world scenario. The virtual camera is placed at a similar position as our hardware setup. The camera randomly perturbates the position in each rendering because the position moves slightly and is affected by body movements and poses in the real world. Specifically, the camera is placed at shoulderwidth × 0.45 front of the center point of the shoulder line. Subsequently, the camera position perturbates to move in 3D space according to a normal distribution N(σ² = shoulderwidth × 0.1). We apply the intrinsic camera parameters that are obtained from the real omnidirectional camera using the omnidirectional camera calibration toolbox ocamcalib [17]. The rendered images are augmented with the backgrounds randomly chosen from 50 indoor and 54 outdoor images captured by a real omnidirectional camera.

Our synthetic training dataset contains the ground truth annotations of 2D/3D joint positions that are easily generated using the 3D mocap data and the camera calibration toolbox. We acquire the following 18 body joints: head, neck, spine, pelvis, shoulders, elbows, wrists, hands, hips, knees, and ankles. The 3D joint positions are incorporated into the omnidirectional camera coordinate system for our hardware setup. We collect a total of 151,280 synthetic images with ground truth 2D/3D pose annotations for our large-scale training dataset. The examples are described in Fig. 2.

3.2 Real evaluation dataset

We collect a dataset in real situations for quantitative evaluation. We record 2 people with general clothing indoors and outdoors for 5 actions (boxing, dancing, hands up, sitting, and walking).

The ground truth 3D joint positions are recorded using a commercial external RGB-D camera^{Footnote 2} and 3D skeleton tracking software.^{Footnote 3} We simultaneously record the mounted omnidirectional camera images and the 3D joint positions obtained by the RGB-D camera and the tracking software. The 3D joint positions are postprocessed to be in the omnidirectional camera coordinate system using the fixed omnidirectional camera position. The ground truth 2D joint locations are converted from the 3D joint positions using the camera calibration toolbox.

We use different 3D mocap systems for the synthetic training dataset and the real evaluation dataset. Therefore, the 3D skeleton structure and scale are slightly different. To reduce the gap, we normalize the 3D joint positions of both datasets using the following 3 steps while maintaining the skeleton shape: (1) Move to shrink joint positions for the shoulder width to be 1.0; (2) rotate joint positions for the shoulder line to be horizontal; and (3) rotate joint positions for the line between the neck and pelvis to be vertical.

3.3 Vector- and distance-based lift-up model

We propose a lift-up model that is a simple feed-forward network to estimate 3D joint positions from 2D joint locations that works with a 2D joint location estimator in the pipeline approach. Our lift-up model is mainly based on Martinez et al. [11] and involves batch normalization, dropout, ReLU, residual connections [31], and training with the max-norm constraint for high accuracy and generalization. Additionally, our lift-up model is customized to 3D ego-pose estimation from a mounted omnidirectional camera setup as described below.

3D Unit Vectorization: We convert the estimated 2D joint locations on the input plane to 3D unit vectors toward each 3D joint position in the omnidirectional camera coordinate system. The 3D unit vectorization is implemented using the omnidirectional camera calibration toolbox ocamcalib [17] and our camera parameters. The information quantity of the input is enhanced in 3D space based on the 2D plane.
Vector and Distance Loss Function: Martinez et al. [11] trained the model with the L2 loss function. In the omnidirectional camera coordinate system, a 3D joint position can be decomposed into a 3D unit vector and a distance (magnitude of the vector). We design a new loss function based on the vector and the distance (VD loss) inspired by the decomposable feature and 3D unit vector input of the lift-up model:
$$ \begin{array}{@{}rcl@{}} \text{VDLoss}({\boldsymbol P}_{j}) = \lambda_{\theta} {\theta}\!\left( {\boldsymbol P}_{j}^{GT}, {\boldsymbol P}_{j}\right) + \lambda_{d} \mathrm{D}\!\left( \|{\boldsymbol P}_{j}^{GT}\|, \|{\boldsymbol P}_{j}\|\right) \end{array} $$
for a 3D joint position P_j where GT means the ground truth. The loss function is composed of the cosine-similarity error and the distance error:
$$ \begin{array}{@{}rcl@{}} {\theta}({\boldsymbol x}, {\boldsymbol y}) = 1 - \frac{{\boldsymbol x} \cdot {\boldsymbol y}}{\|{\boldsymbol x}\| \times \|{\boldsymbol y}\|}, & & \mathrm{D}(x, y) = \lvert x - y \rvert. \end{array} $$
We determine the coefficients λ_𝜃 = 1.0 and λ_d = 0.1 based on the results of a grid search. The VD loss regresses the input 3D unit vectors to 3D joint positions with the constraint of cosine-similarity error of vectors.

We describe the pipeline process to estimate the 3D ego-pose from the mounted omnidirectional camera images in Fig. 3. Our lift-up model can flexibly adjust to model complexity and size according to the number of weights w and the number of residual blocks b. The most important advantage of our model is that it is separately trainable with ground truth 3D joint positions and unit vectors.

4 Evaluation

We quantitatively evaluate our lift-up model on the synthetic training dataset and the real evaluation dataset. We sample 18,910 images from the synthetic training dataset because of our poor computational resources. We use the mean joint position error (MJPE) and the percentage of correct keypoints (PCK) as the evaluation metrics. The error is the Euclidean distance between the estimation and the ground truth of a 3D joint position. In the evaluation, we rescale the normalized 3D joint positions to those of the real measurements using the shoulder width (mm) of the evaluation dataset.

4.1 Implementation and training details

Our pipeline approach requires the use of a 2D joint location estimator in the first step. The competing direct regression approaches can be internally decoupled to the 2D module and the 3D module in the model. The 2D module estimates 2D joint locations by heatmap regression. The 3D module extends the 2D module results to 3D joint positions using each model’s method. In the evaluation, we apply the competing 2D module results to our 3D unit vectorization input for fair comparison.

We first train the competing direct regression models Mo²Cap² [14] and VNect [9]. The competing models based on ResNet50 [31] output 32 × 64 pixel heatmaps in their 2D module from images with a resolution of 256 × 512 pixels. We pretrain the 2D module on a 2D pose estimation task using the MPII Human Pose dataset [32]. The models can learn good low-level features from real images with normal camera optics. Subsequently, we fine-tune the pretrained models on our synthetic training dataset to learn omnidirectional camera optics for the 2D/3D modules. Fine-tuning is scheduled to be carried out for 32 batches and 70 epochs. We use the Adam optimizer and the initial learning rate of 0.05. We decrease the learning rate to 0.001 for the initial 13 residual blocks of the models to preserve the low-level features learned from the real images.

Our lift-up model is trained with 2D joint locations estimated by the competing 2D module on the synthetic training dataset. The training is carried out with 32 batches, 70 epochs, using the Adam optimizer, and with the initial learning rate set to 0.001. We determine the model’s architecture for 8 weights (w = 8) and 0 residual blocks (b = 0) in this evaluation because of the ablation study results. Our lift-up model has only 2 fully connected layers without residual blocks.

4.2 Comparison with previous works

We quantitatively compare our lift-up model to the related previous works on the real evaluation dataset. We compare the results to those of Mo²Cap² [14] and x R-EgoPose [15, 16], which are 3D ego-pose estimations from a single mounted fisheye camera. We use a dual-branch encoder-decoder model with 3D pose and heatmaps as x R-EgoPose in this comparison. Additionally, we compare to VNect [9], which was validated for a mounted omnidirectional camera setup by Miura and Sako [27]. The decoupled two-step approaches (x R-EgoPose and ours) are trained and evaluated for each 2D module results of Mo²Cap² and VNect.

We present the MJPE(mm) and PCK@30 mm results in Table 1. Our lift-up model shows worse performance than x R-EgoPose in MJPE(mm) even though it obtains the best accuracy in PCK@30 mm. These results mean that our model achieves high accuracy when the 2D joint location estimator works well. However, the deterioration of 2D joint location estimation has a worse impact on our lift-up model because the robustness is less than x R-EgoPose. Additionally, our lift-up model obtains higher accuracy with the 2D module of VNect and Mo²Cap² than the original models.

Table 1 MJPE (mm) results on a real evaluation dataset for comparison with previous works. PCK@30 mm is also indicated in brackets

Full size table

We indicate the model parameter size and the estimated required time on a CPU (Intel Xeon @ 2.20 GHz) in Table 2. We indicate 2D module results that exclude the 3D module from the models for VNect and Mo²Cap² because the models contain the whole process to estimate 3D joint positions from the input image. Our lift-up model shows better parameter size and execution time even when combined with 2D modules.

Table 2 Model parameter size and estimated time on CPU (Intel Xeon @ 2.20 GHz) for previous works and our model. We indicate 2D module results as well in brackets for VNect and Mo²Cap²

Full size table

We present examples of 3D ego-pose estimation results by our lift-up model with VNect’s 2D module in Fig. 4. The skeleton structures of estimation and ground truth are slightly different because the 3D joint positions of the synthetic training dataset and the real evaluation dataset are acquired by different mocap systems.

5 Ablation study

To further analyze our approach, we evaluate some additional aspects of our lift-up model, namely, the accuracy and generalization of the model parameters (w and b), effectiveness of 3D unit vectorization and VD loss function, and the possibility of training with ground truth 3D pose annotations. Additionally, we evaluate our model on the Mo²Cap² dataset [14].

5.1 Model parameters

We describe the MJPE(mm) results and the standard deviations of our lift-up model in Fig. 5. The model parameters are changed as follows: the number of weights w is changed from 4 to 16, and the number of residual blocks b is changed from 0 to 3. We note that VNect’s 2D module provides better 2D joint locations than Mo²Cap²’s 2D module.

The larger weights model estimates 3D joint positions with higher accuracy even though the increased number of weights leads to a deterioration of the generalization for worse 2D joint locations estimated by Mo²Cap²’s 2D module because of overfitting to training dataset. For the residual blocks, increasing the number of residual blocks slightly improves accuracy and generalization in the appropriate weights model.

In the case of our training and evaluation datasets, we determine our model parameters for 8 weights (w = 8) and 0 residual blocks (b = 0) based on the aspects of accuracy, generalization, and model size.

5.2 3D unit vectorization and VD loss function

We present the MJPE(mm) and PCK@30 mm results of Martinez et al. [11] and our models in Table 3 for the effectiveness analysis of 3D unit vectorization and the VD loss function.

Table 3 MJPE (mm) results of Martinez et al. and our lift-up models. PCK@30 mm is also indicated in brackets

Full size table

The Martinez et al. model inputs only the estimated 2D joint locations and is trained with the L2 loss function. In our pipeline approach, we convert the 2D joint locations to 3D unit vectors for our lift-up model input. The 3D unit vectorization provides better accuracy for both 2D module base results. Comparison with L2 loss and VD loss in our model results shows that our VD loss also provides better estimation results.

5.3 Ground truth training

We present the MJPE(mm) and PCK@30 mm results in Table 4. Our lift-up model is trained with the ground truth 3D joint positions and the unit vectors on the synthetic training dataset. Additionally, we train our model with an exclusive synthetic dataset that is unused in the training session of the 2D joint location estimator, even though it is collected in synthetic data generation. The ground truth 3D joint positions and the unit vectors are easily generated from 3D pose annotations.

Table 4 MJPE (mm) results of our lift-up models trained with estimated 2D joint locations, ground truth 3D pose annotations, and ground truth 3D pose annotations of exclusive synthetic dataset. PCK@30 mm is also indicated in brackets

Full size table

The training using 2D estimation results performs better in general because the model can learn robustness to the 2D joint locations estimator’s error. However, our lift-up model shows comparable accuracy even for the ground truth dataset that is not used in the 2D module training session.

5.4 Evaluation on the Mo²Cap² Dataset

We compare our lift-up model with previous works [8, 9, 14,15,16] on the Mo²Cap² dataset. We cannot obtain the estimated 3D unit vectors of the training dataset because the Mo²Cap² dataset does not provide the fisheye camera intrinsic parameters. Therefore, we use ground truth 3D joint positions and the unit vectors on the training dataset.

We train our model on the ground truth training dataset. We customize the model parameters 2 residual blocks (b = 2) and 4096 weights (w = 4096). The model parameter size is 67.5 M, and the execution time is 16.592 (ms / frame) on a CPU. The training is carried out with λ_𝜃 = 1.0, λ_d = 0.00001, 1024 batches, and 5K epochs.

We show the MJPE (mm) results in Table 5. Our model is worse than Mo²Cap² in total average, although the performance is better than Mo²Cap² results in some actions. We present examples of estimation results in Fig. 6.

Table 5 MJPE (mm) results on the Mo²Cap² dataset for comparison with previous works

Full size table

6 Discussion

6.1 Dataset restriction

Our synthetic and real datasets have two restrictions: (1) the fixed omnidirectional camera position and (2) the skeleton normalization while recording. These restrictions make 3D ego-pose estimation an easier problem. However, in the scene of wearable devices in daily use, the setup position is usually determined for each purpose, for example a head mounted display, a smart glass, or a microphone headset. For skeleton normalization, the human pose is more important than the real-scale 3D joint positions that can be computed by postprocessing with user data. For these reasons, our dataset restrictions are acceptable in practical application scenarios.

6.2 Model performance and size

Our lift-up model is flexibly adjustable to the performance of the 2D joint location estimator according to the model parameters of weights and residual blocks. Specifically, increasing the number of weights and residual blocks followed with the 2D estimator’s quality improves the accuracy of 3D joint position estimation.

Another advantage is that our model size does not increase with increasing input image resolution. Enlargement of input images is the most direct approach for improving the accuracy of 2D/3D joint position estimation. However, this method leads to an increase in the parameters for the convolutional neural networks such as VNect [9], Mo²Cap² [14], and x R-EgoPose [15, 16]. Our model can restrain the increase in the model parameters with enlargement of the input images even though it is limited after 2D joint location estimation.

6.3 Separately training the 2D/3D module

Our lift-up model is trainable with ground truth 3D joint positions and unit vectors. This advantage enables the separate training of the 2D joint location estimator and our model.

x R-EgoPose is also separately trainable with the ground truth 3D mocap dataset and the 2D projection. However, the 2D joint location heatmaps are converted differently according to the omnidirectional camera properties and setups. The 2D-3D dependency requires data collection and model training for each camera property and setup.

Our pipeline approach separates the 2D-3D dependency of 3D human pose estimation in terms of both intrinsic parameters and the data collection pipeline in the field of omnidirectional cameras. Specifically, the 3D unit vectorization module confines the impact of camera optics by the module parameters according to the omnidirectional camera intrinsic. Therefore, our model is separately trainable with the ground truth 3D joint positions and the unit vectors that are easily generated from the existing publicly available 3D mocap datasets.

7 Conclusion

We proposed a 3D ego-pose estimation from a single mounted omnidirectional camera that captures the entire circumference by back-to-back dual fisheye cameras. The 360^∘ field of view captures the user’s body under a wide variety of motions and provides flexibility for the camera setup position.

We built a new large-scale synthetic training dataset and a real evaluation dataset using our unique omnidirectional camera setup. The dataset contains 151K synthetic and 5K real images with ground truth 2D/3D pose annotations and is publicly available.

We proposed a pipeline approach using a lift-up model to estimate the 3D joint positions from 2D joint locations. The model works with a 2D joint locations estimator and is customized to the mounted omnidirectional camera setup: 3D unit vectorization from 2D joint locations and training with vector- and distance-based loss function (VD loss). The reasonable customized model can be used in real time yet shows accuracy comparable to those of the previous methods on our dataset.

The most important advantage of our pipeline approach is to separate the 2D-3D dependency of 3D human pose estimation in terms of both intrinsic parameters and the data collection pipeline in the field of omnidirectional cameras. Our lift-up model is not required to learn input image features that are drastically different due to the effect of factors such as the optical properties of the lens, angle of views, and setup positions because the 3D unit vectorization module confines the impact of camera optics by the module parameters according to the omnidirectional camera intrinsic. Therefore, our model is separately trainable with ground truth 3D joint positions and unit vectors that are easily generated from the existing publicly available 3D mocap datasets. This advantage alleviates the data collection and training burden corresponding to the change of the camera optics and setups although it is limited after the 2D joint location estimation.

Availability of data and material

The datasets generated and/or analyzed during the current study are available under the license in the NIT-SIMPLE-OMNI repository, https://drive.google.com/drive/folders/1ps92B1bYN5QuAWQfrq2Db8K3mQw5gqny?usp=sharing.

Notes

Carnegie Mellon University Motion Capture Database: http://mocap.cs.cmu.edu/
Intel RealSense Depth Camera D435: https://www.intelrealsense.com/depth-camera-d435/
Nuitrack: https://nuitrack.com/

References

Tome D, Toso M, Agapito L, Russell C (2018) Rethinking Pose in 3D: Multi-stage Refinement and Recovery for Markerless Motion Capture. International Conference on 3D Vision, pp 474–483, https://doi.org/10.1109/3DV.2018.00061
Qiu H, Wang C, Wang J, Wang N, Zeng W (2019) Cross View Fusion for 3D Human Pose Estimation. IEEE International Conference on Computer Vision, pp 4341–4350. https://doi.org/10.1109/ICCV.2019.00444
Iskakov K, Burkov E, Lempitsky V, Malkov Y (2019) Learnable Triangulation of Human Pose. IEEE International Conference on Computer Vision, pp 7717–7726. https://doi.org/10.1109/ICCV.2019.00781
Tekin B, Katircioglu I, Salzmann M, Lepetit V, Fua P (2016) Structured Prediction of 3D Human Pose with Deep Neural Networks. British Machine Vision Conference, pp 130.1–130.11. https://doi.org/10.5244/C.30.130
Tekin B, Rozantsev A, Lepetit V, Fua P (2016) Direct Prediction of 3D Body Poses from Motion Compensated Sequences. IEEE Conference on Computer Vision and Pattern Recognition, pp 991–1000. https://doi.org/10.1007/978-3-319-49409-8_17
Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. IEEE Conference on Computer Vision and Pattern Recognition, pp 1263–1272. https://doi.org/10.1109/CVPR.2017.139
Zhou X, Sun X, Zhang W, Liang S, Wei Y (2016) Deep Kinematic Pose Regression. European Conference on Computer Vision, pp 186–201. https://doi.org/10.1007/978-3-319-49409-8_17
Mehta D, Rhodin H, Casas D, Fua P, Sotnychenko O, Xu W, Theobalt C (2017) Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision. International Conference on 3D Vision, pp 506–516. https://doi.org/10.1109/3DV.2017.00064
Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel H-P, Xu W, Casas D, Theobalt C (2017) VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans Graph 36:1–14. https://doi.org/10.1145/3072959.3073596
Article Google Scholar
Mehta D, Sotnychenko O, Mueller F, Xu W, Sridhar S, Pons-Moll G, Theobalt C (2018) Single-Shot Multi-person 3D Pose Estimation from Monocular RGB. International Conference on 3D Vision, pp 120–130. https://doi.org/10.1109/3DV.2018.00024
Martinez J, Hossain R, Romero J, Little JJ (2017) A Simple Yet Effective Baseline for 3d Human Pose Estimation. IEEE International Conference on Computer Vision, pp 2659–2668. https://doi.org/10.1109/ICCV.2017.288
Xiaowei Z, Menglong Z, Spyridon L, Kostas D (2017) Sparse Representation for 3D Shape Estimation: A Convex Relaxation Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8):1648–1661. https://doi.org/10.1109/TPAMI.2016.2605097
Article Google Scholar
Jiang H, Grauman K (2017) Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video. IEEE Conference on Computer Vision and Pattern Recognition, pp 3501–3509. https://doi.org/10.1109/CVPR.2017.373
Xu W, Chatterjee A, Zollhöfer M, Rhodin H, Fua P, Seidel H-P, Theobalt C (2019) Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera. IEEE Trans Vis Comput Graph 25(5):2093–2101. https://doi.org/10.1109/TVCG.2019.2898650
Article Google Scholar
Tome D, Peluse P, Agapito L, Badino H (2019) xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera. IEEE International Conference on Computer Vision, pp 7727–7737. https://doi.org/10.1109/ICCV.2019.00782
Tome D, Alldieck T, Peluse P, Pons-Moll G, Agapito L, Badino H, la Torre FD (2020) SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–1. https://doi.org/10.1109/TPAMI.2020.3029700
Scaramuzza D, Martinelli A, Siegwart R (2006) A Toolbox for Easily Calibrating Omnidirectional Cameras. IEEE/RSJ International Conference on Intelligent Robots and Systems, pp 5695–5701. https://doi.org/10.1109/IROS.2006.282372
Wei S-E, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional Pose Machines. IEEE Conference on Computer Vision and Pattern Recognition, pp 4724–4732. https://doi.org/10.1109/CVPR.2016.511
Newell A, Yang K, Deng J (2016) Stacked Hourglass Networks for Human Pose Estimation. European Conference on Computer Vision, pp 483–499. https://doi.org/10.1007/978-3-319-46484-8_29
Xiao B, Wu H, Wei Y (2018) Simple Baselines for Human Pose Estimation and Tracking. European Conference on Computer Vision, pp 472–487. https://doi.org/10.1007/978-3-030-01231-1_29
Sun K, Xiao B, Liu D, Wang J (2019) Deep High-Resolution Representation Learning for Human Pose Estimation. IEEE Conference on Computer Vision and Pattern Recognition, pp 5686–5696. https://doi.org/10.1109/CVPR.2019.00584
Ma M, Fan H, Kitani KM (2016) Going Deeper into First-Person Activity Recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp 1894–1903. https://doi.org/10.1109/CVPR.2016.209
Cao C, Zhang Y, Wu Y, Lu H, Cheng J (2017) Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules. IEEE International Conference on Computer Vision, pp 3783–3791. https://doi.org/10.1109/ICCV.2017.406
Ahuja K, Harrison C, Goel M, Xiao R (2019) MeCap: Whole-Body Digitization for Low-Cost VR/AR Headsets. ACM Symposium on User Interface Software and Technology, pp 453–462. https://doi.org/10.1145/3332165.3347889
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Conference on Computer Vision and Pattern Recognition, pp 1302–1310. https://doi.org/10.1109/CVPR.2017.143
Rhodin H, Richardt C, Casas D, Insafutdinov E, Shafiei M, Seidel H-P, Schiele B, Theobalt C (2016) EgoCap: Egocentric Marker-Less Motion Capture with Two Fisheye Cameras. ACM Trans Graph, 35(6). https://doi.org/10.1145/2980179.2980235
Miura T, Sako S (2020) 3D human pose estimation model using location-maps for distorted and disconnected images by a wearable omnidirectional camera. IPSJ Transactions on Computer Vision and Applications. https://doi.org/10.1186/s41074-020-00066-8
Zhang Y, You S, Gevers T (2021) Automatic Calibration of the Fisheye Camera for Egocentric 3D Human Pose Estimation From a Single Image. IEEE Winter Conference on Applications of Computer Vision, pp 1771–1780. https://doi.org/10.1109/WACV48630.2021.00181
Varol G, Romero J, Martin X, Mahmood N, Black MJ, Laptev I, Schmid C (2017) Learning from Synthetic Humans. IEEE Conference on Computer Vision and Pattern Recognition, pp 4627–4635. https://doi.org/10.1109/CVPR.2017.492
Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) SMPL: a skinned multi-person linear model. ACM Trans Graph, 34(6). https://doi.org/10.1145/2816795.2818013
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. IEEE Conference on Computer Vision and Pattern Recognition, pp 3686–3693. https://doi.org/10.1109/CVPR.2014.471

Download references

Funding

This work was supported by JST SPRING Grant Number JPMJSP2112, JSPS KAKENHI Grant Number 17H06114, and the Kayamori Foundation of Information Science Advancement.

Author information

Authors and Affiliations

Computer Science, Nagoya Institute of Technology, Showa, Nagoya, 4668555, Aichi, Japan
Teppei Miura & Shinji Sako

Authors

Teppei Miura
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Sako
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T. Miura proposed the initial research idea, collected the dataset, conducted the experiments, and wrote the manuscript. S. Sako supervised the work and advised the entire research process. Both authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Teppei Miura.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Miura, T., Sako, S. Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera. Appl Intell 53, 2616–2628 (2023). https://doi.org/10.1007/s10489-022-03417-3

Download citation

Accepted: 18 February 2022
Published: 11 May 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10489-022-03417-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Simple yet effective 3D ego-pose lift-up based on vector and distance for a mounted omnidirectional camera

Abstract

Similar content being viewed by others

Beyond Weak Perspective for Monocular 3D Human Pose Estimation

Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation

UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture

1 Introduction

2 Related work

2.1 3D pose estimation from an external camera

2.2 3D pose estimation from an egocentric perspective

2.3 3D ego-pose estimation from mounted fisheye cameras

3 Approach

3.1 Synthetic training dataset

3.2 Real evaluation dataset

3.3 Vector- and distance-based lift-up model

4 Evaluation

4.1 Implementation and training details

4.2 Comparison with previous works

5 Ablation study

5.1 Model parameters

5.2 3D unit vectorization and VD loss function

5.3 Ground truth training

5.4 Evaluation on the Mo2Cap2 Dataset

6 Discussion

6.1 Dataset restriction

6.2 Model performance and size

6.3 Separately training the 2D/3D module

7 Conclusion

Availability of data and material

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

5.4 Evaluation on the Mo²Cap² Dataset