EfficientPose: Scalable single-person pose estimation

Human pose estimation facilitates markerless movement analysis in sports, as well as in clinical applications. Still, state-of-the-art models for human pose estimation generally do not meet the requirements for real-life deployment. The main reason for this is that the more the field progresses, the more expensive the approaches become, with high computational demands. To cope with the challenges caused by this trend, we propose a convolutional neural network architecture that benefits from the recently proposed EfficientNets to deliver scalable single-person pose estimation. To this end, we introduce EfficientPose, which is a family of models harnessing an effective multi-scale feature extractor, computation efficient detection blocks utilizing mobile inverted bottleneck convolutions, and upscaling improving precision of pose configurations. EfficientPose enables real-world deployment on edge devices through 500K parameter model consuming less than one GFLOP. The results from our experiments, using the challenging MPII single-person benchmark, show that the proposed EfficientPose models substantially outperform the widely-used OpenPose model in terms of accuracy, while being at the same time up to 15 times smaller and 20 times more computationally efficient than its counterpart.


Introduction
Human pose estimation (HPE) refers to the computer vision task of localizing human skeletal keypoints from images or video frames. HPE has many real-world applications, ranging from outdoor activity recognition and computer animation to clinical assessments of motor repertoire and skill practice among professional athletes. The proliferation of deep convolutional neural networks (ConvNets) has advanced HPE and further widen its application areas. Although ConvNet-based HPE with its increasingly complex network structures, combined with transfer learning is a very challenging task; the availability of high-performing ImageNet [9] backbones, together with large tailor-made datasets, such as MPII for 2D pose estimation [2], has facilitated developing new improved methods to address the challenges.
An increasing trend in computer vision drives towards more efficient models [11,16,29,38]. Recently, Ef-ficientNet [39] was released as a scalable ConvNet architecture setting benchmark record on ImageNet with a more computation efficient architecture. However, within HPE there is lack of accurate and computation efficient architectures. State-of-the-art architectures in HPE are computationally expensive and highly complex, thus making the networks hard to replicate, cumbersome to optimize, and impractical to embed into real-world applications. The OpenPose network [6] is one of the most applied HPE methods in real-world applications [8,13,23,25], and is the first open-source real-time system for HPE. However, the OpenPose architecture is highly computationally inefficient, spending 160 billion floatingpoint operations (GFLOPs) per inference. Moreover, the level of detail in OpenPose keypoint estimates is limited by its low-resolution outputs. This makes Open-arXiv:2004.12186v1 [cs.CV] 25 Apr 2020 Pose less suitable for precision-demanding applications, such as elite sports and medical assessments, which all depend on high degree of precision in the assessment of movement kinematics. Despite these challenges, Open-Pose remains a widely used HPE network for markerless motion capture from which critical decisions are based upon [3,8,47].
In this paper, we call attention to the lack of publicly available methods for HPE that both are computationally efficient and provide high-precision estimates. To this end, we harness recent progress in ConvNets to propose a novel approach called EfficientPose, which is a family of scalable ConvNets for single-person pose estimation from 2D images. To show how it stands against existing available methods, we compare the precision and computation efficiency of EfficientPose with Open-Pose on single-person HPE. The proposed models aim to elicit improved precision levels, while bridging the gap in availability of high-precision HPE networks.
The main contributions of this paper are the following: -We propose a novel method, called EfficientPose, that can overcome the shortcomings of the popular OpenPose network on single-person HPE with improved level of precision, rapid convergence during optimization, friendly model size, and low computational cost. -With EfficientPose, we suggest an approach providing scalable models that can suit various demands, enabling a trade-off between accuracy and efficiency across diverse application constraints and computational budgets. -We propose a new way to incorporate mobile Con-vNet components, which can address the need for computation efficient architectures for HPE, thus facilitating real-time HPE on the edge. -We perform an extensive comparative study to evaluate our approach. Our experimental results show that the proposed method achieves significantly higher efficiency and accuracy in comparison to the baseline method, OpenPose. In addition, compared to existing state-of-the-art methods, it achieves competitive results, with a much smaller number parameters.
The remainder of this paper is organized as follows. Section 2 describes the architecture of OpenPose and highlights research which it can be improved from. Based on this, Section 3 presents our proposed ConvNetbased approach, EfficientPose. Section 4 describes our experiments and presents the results from comparing EfficientPose with OpenPose and other existing approaches. Section 5 discusses our findings and suggests poten-tial future studies. Finally, Section 6 summarizes and concludes the paper.
To promote research on beneficial applications within movement science, we will make the EfficientPose models available at https://github.com/daniegr/Effici entPose.

Related work
The proliferation of ConvNets for HPE following the success of DeepPose [45] has set the path for accurate human pose estimation (HPE). With OpenPose, Cao et al. [6] made HPE available to the public. As depicted by Figure 1, OpenPose comprises a multi-stage architecture performing a series of detection passes. Provided an input image of 368×368 pixels, OpenPose utilizes an ImageNet pretrained VGG-19 backbone [32] to extract basic features (step 1 in Figure 1). The features are supplied to a DenseNet-inspired detection block (step 2) arranged as five dense blocks [18], each containing three 3×3 convolutions with PReLU activations [14]. The detection blocks are stacked in sequence. First, four passes (step 3a-d in Figure 1) of part affinity fields [7] map associations between body keypoints. Subsequently, two detection passes (step 3e and 3f) predict keypoint heatmaps [44] to obtain refined keypoint coordinate estimates. In terms of level of detail in keypoint coordinates, Open-Pose is restricted by its output resolution of 46 × 46 pixels.
The OpenPose architecture can be improved by recent advancements in ConvNets. First, automated network architecture search has found backbones [39,40,52] more precise and efficient in image classification than VGG and ResNets [15,32]. Tan et al. [39] proposed compound model scaling to balance image resolution, width (number of network channels), and depth (number of network layers). This resulted in scalable EfficientNets [39], flexible in model size and precision level. For each model variant φ, from EfficientNet-B0 to EfficientNet-B7 (φ ∈ [0, 7] ∈ Z ≥ ), the total number of FLOPs increases by a factor of 2 (Equation 1). Coefficients for depth (α), width (β), and resolution (γ) are given in Equation 2.
Second, parallel multi-scale feature extraction has raised the precision levels in HPE [19,24,36,48], emphasizing both high spatial resolution and low-scale semantics. However, existing multi-scale approaches in Fig. 1 OpenPose architecture utilizing 1) VGG-19 feature extractor, and 2) 4+2 passes of detection blocks performing 4+2 passes of estimating part affinity fields (3a-d) and confidence maps (3e and 3f) HPE are computationally expensive, reflected by 10−72 GLOPS, and of large size, typically 16 − 56 million parameters [24,27,35,36,41,48,51]. To cope with this, we propose cross-resolution features to integrate features from multiple abstraction levels with low overhead in network size and computation efficiency. Third, mobile inverted bottleneck convolution (MBConv) [29] with built-in squeeze-and-excitation (SE) [17] and Swish activation [28] as integrated in EfficientNets proves more accurate in image classification [39,40] compared to regular convolutions [15,18,37], while reducing FLOPs by up to 19× [39]. The efficiency of MBConv modules stem from the depthwise convolutions operating in a channelwise manner [31]. This reduces the computational cost by a factor proportional to the number of channels [40]. The extensive use of regular 3 × 3 convolutions with up to 384 input channels in detection blocks reflects the potentials of MBConvs in OpenPose. Further, SE selectively emphasizes discriminative features [17], which may reduce the required number of convolutions and detection passes. Utilizing MBConv with SE may have the potential to decrease the number of dense blocks in OpenPose. Fourth, transposed convolutions with bilinear kernel [22] upscale low-resolution feature maps enabling a higher level of detail in output confidence maps.
By building upon the work of Tan et al. [39], we present a pool of scalable models for single-person HPE, overcoming the shortcomings of the commonly adopted OpenPose architecture. This enables trading off between accuracy and efficiency across computational budgets in real-world applications. In contrast to OpenPose, this opens for ConvNets that are small and computation-ally efficient enough to run on edge devices with little memory and low processing power. We address the gap in openly available scalable and computation efficient ConvNets for single-person HPE.

Architecture
The EfficientPose network ( Figure 2) introduces several modifications of the OpenPose architecture shown in Figure 1, including 1) high-and low-resolution input images, 2) scalable EfficientNet backbones, 3) crossresolution features, 4 and 5) scalable Mobile DenseNet detection blocks in fewer detection passes, and 6) bilinear upscaling. For a more thorough component analysis of EfficientPose, see Appendix A.
The input of the network is comprised of high-and low-resolution images (1a and 1b in Figure 2). The lowresolution image is downsampled to half the pixel height and width of the high-resolution image by an initial average pooling layer.
The feature extractor of EfficientPose is composed of the initial blocks of EfficientNets [39] pretrained on ImageNet (step 2a and 2b in Figure 2). High-level semantic information is obtained from the high-resolution image using the initial three blocks of a high-scale Ef-ficientNet with φ ∈ [2, 7] (Equation 1), outputting C feature maps (2a). Low-level local information is extracted from the low-resolution image by the first two blocks of a lower-scale EfficientNet-backbone (2b) in the range φ ∈ [0, 3]. Table 1 displays the composition of EfficientNet backbones, from low-scale B0 to high- Fig. 2 Proposed architecture comprising 1a) high-resolution and 1b) low-resolution inputs, 2a) high-level and 2b) low-level Ef-ficientNet backbones combined into 3) cross-resolution features, 4) Mobile DenseNet detection blocks, 1+2 passes for estimation of part affinity fields (5a) and confidence maps (5b and 5c), and 6) bilinear upscaling scale B7. The first block of EfficientNets utilizes the MBConvs described in Figure 3a and 3b, while the second and third blocks comprise the MBConv layers in Figure 3c and 3d.
The features generated by the low-level and highlevel EfficientNet backbones are concatenated to yield cross-resolution features (step 3 in Figure 2). This enables the EfficientPose architecture to selectively emphasize important local factors from the image of interest and overall structures guiding high-quality pose estimation. In this way, we enable an alternative simultaneous handling of features at multiple abstraction levels.
From the extracted features, the desired keypoints are localized through an iterative detection process exploiting intermediate supervision. Each detection pass comprises a detection block and a single 1 × 1 convolution for output prediction. The detection blocks across all detection passes elicit the same basic architecture, comprising Mobile DenseNets (see step 4 in Figure 2). Data from Mobile DenseNets are forwarded to subsequent layers of the detection block using residual connections. The Mobile DenseNet is inspired by DenseNets [18] supporting reuse of features, avoiding redundant layers, and MBConv with SE, enabling low memory footprint. In our adaptation of the MBConv operation (E-M BConv6(K × K, B, S) in Figure 3e), we consistently utilize the highest performing combina-tion from [38], meaning kernel size (K × K) of 5 × 5 and expansion ratio of 6. We also avoid downsampling (i.e. S = 1) and scale the width of Mobile DenseNets by outputting number of channels relative to the high-level backbone (B = C). We modify the original M BConv6 operation by incorporating E-swish as activation function with β value of 1.25 [1]. This has a tendency to accelerate progression during training compared to the regular Swish activation [28]. We also adjust the first 1 × 1 convolution to generate number of feature maps relative to output feature maps B rather than the input channels M . This reduces memory consumption and computational latency as B ≤ M , with C ≤ M ≤ 3C. With each Mobile DenseNet consisting of three consecutive E-M BConv6 operations, the module outputs 3C feature maps.
EfficientPose performs detection in two rounds (step 5a-c in Figure 2). First, the overall pose of the person is anticipated through a single pass of skeleton estimation (5a). This aims to facilitate detection of feasible poses and to avoid confusion in case of several persons being present in an image. Skeleton estimation is performed utilizing part affinity fields as proposed in [7]. Following skeleton estimation, two detection passes are performed to estimate heatmaps for keypoints of interest. The former of these acts as a coarse detector (5b), while the latter (5c) refines localization to yield more accurate outputs. Table 1 The architecture of the initial three blocks of relevant EfficientNet backbones. For Conv(K × K, N, S), K × K denotes filter size, N is number of output feature maps, and S is stride. BN denotes batch normalization. I defines input size, corresponding with image resolution on ImageNet, while α φ refers to the depth factor as determined by Equation 1 Block  In OpenPose, the heatmaps of the final detection pass are constrained to a low spatial resolution incapable of achieving the amount of detail inherent in the high-resolution input [6]. To improve this limitation of OpenPose, a series of three transposed convolutions performing bilinear upsampling are injected (step 6 in Figure 1). Thus, we project the low-resolution output onto a space of higher resolution, allowing increased level of detail. To achieve the proper level of interpolation while operating efficiently, each transposed convolution increases the map size by a factor of 2, using a stride of 2 with a 4 × 4 kernel.

Variants
In accordance with the original EfficientNet [39], we scale EfficientPose with respect to the three main dimensions, resolution, width, and depth, utilizing the coefficients of Equation 2. Table 2 displays the five variants obtained from scaling the EfficientPose architecture. The resolution, defining the spatial dimensions of the input image (H × W ), is scaled utilizing the high-and low-level EfficientNet backbones ( Table 1) that best matches the resolution of high-and low-resolution inputs. The network width refers to the number of feature maps generated by each E-M BConv6. As described in Section 3.1, width scaling is achieved using the same width as the high-level backbone (i.e. C). To achieve proper scaling also in the depth dimension, variation is achieved in the number of Mobile DenseNets (M D(C) in Table 2) in the detection blocks. This ensures similar relative sizes of receptive fields across the models and spatial resolutions. For each model variant we select the number (D) of Mobile DenseNets that best approximates the original depth factor α φ in the high-level EfficientNet backbone (Table 1). More specifically, the number of Mobile DenseNets are determined by Equation 3, rounding to the closest integer. Besides EfficientPose I to IV, the single-resolution model Effi-cientPose RT is formed to match the scale of the smallest EfficientNet model, providing HPE in extremely low latency applications.

Summary of proposed framework
The EfficientPose framework comprises a family of five ConvNets (EfficientPose RT-IV) that benefits from compound scaling [39] and advances in computationally efficient ConvNets for image recognition to construct a scalable network architecture, performing single-person HPE across different computational constraints. More specifically, EfficientPose utilizes both high-and lowresolution images to provide two separate viewpoints that are processed independently through high-and low-level backbones, respectively. The resulting features are concatenated to yield cross-resolution features enabling selective emphasis on global and local image information. The detection stage utilizes a scalable mobile detection block to perform detection in three passes. The first pass estimates person skeletons through part affinity fields [7] to yield feasible pose configurations. The second and third passes estimate keypoint locations with progressive improvement in precision. Finally, the low-resolution prediction of the third pass is upscaled through bilinear interpolation to yield further improvement of precision level.

Experimental setup
We evaluate EfficientPose and compare it with Open-Pose on the single-person MPII dataset [2], containing images of humans in a wide range of different activities and situations. All models are optimized on MPII using SGD with cyclical learning rates (Appendix B). The training and validation portion of the dataset comprises 29K images, and by adopting a standard random split we obtain 26K and 3K instances for training and validation, respectively. We augment images during training using random horizontal flipping, scaling (0.75 − 1.25) and rotation (+/− 45 degrees). The evaluation of model accuracy is performed using the P CK h @τ metric. P CK h @τ is defined as the fraction of predictions residing within a distance τ l from the ground truth location, with l being 60% of the diagonal of the head bounding box and τ the percentage of misjudgment relative to l that is accepted. P CK h @50 is the standard performance metric for MPII but we also include the stricter P CK h @10 metric for assessing models ability to yield highly precise keypoint estimates. In a common fashion, final model predictions are obtained applying multi-scale testing procedure [36,41,48]. Due to restriction in number of attempts for official evaluation on MPII, test metrics are only supplied for the OpenPose baseline, and the most efficient and most accurate models, EfficientPose RT and EfficientPose IV, respectively. To measure model efficiency, both FLOPs and number of parameters are supplied.  Table 3 shows the results of our experiments when evaluating OpenPose and EfficientPose on the MPII validation dataset. This reveals that EfficientPose consistently outperforms OpenPose with regards to efficiency, with 4 − 56× smaller model size and 2.2 − 184× reduction in FLOPs. All model variants elicit improved high-precision localization with 0.8 − 12.9% gain in P CK h @10 compared to OpenPose. Furthermore, the high-end models EfficientPose II-IV yield 0.6 − 2.2% improvement in P CK h @50. As Table 4 depicts, Ef-ficientPose IV achieves state-of-the-art results (mean P CK h @50 of 91.2) on the official MPII test dataset for models of less than 10 million parameters. Compared to OpenPose, EffcientPose also displays rapid convergence during training. We optimized both approaches on similar input resolution, which defaults to 368 × 368 for OpenPose, corresponding to Efficient-Pose II. The training graph demonstrates that Efficient-Pose converges early while OpenPose requires up to 400 epochs for proper convergence ( Figure 4). As Table 5 points out, OpenPose benefits from prolonged training with 2.6% improvement in P CK h @50 during the final 200 epochs, while EfficientPose II has a minor gain of 0.4%.

Discussion
Facilitated by cross-resolution features and upscaling of output (see Appendix A), EfficientPose elevates the precision-level compared to OpenPose [6] by 57% relative improvement in P CK h @10 on single-person MPII (Table 3). This reflects the inherent limitation of the OpenPose architecture when it comes to performing HPE in a single-person domain. EfficientPose more consistently supplies movement information of high spatial resolution. Hence, EfficientPose proves more promising in addressing precision-demanding single-person HPE applications like medical assessments and elite sports.
Precision of HPE methods is a key factor for analyses of movement kinematics, like segment positions and joint angles, for assessment of sport performance in athletes, or motor disabilities in patients. On the contrary, for some applications (e.g. exercise games and baby monitors), we are more interested in the latency of the system and its ability to respond quickly. In these scenarios, the degree of correctness in keypoint coordinates may be less crucial. The scalability of EfficientPose enables flexibility in these various situations and across different types of hardware, whereas OpenPose suffers from its large model size and number of FLOPs.
The use of MBConvs in HPE is to our knowledge an unexplored path. Nevertheless, EfficientPose approached state-of-the-art performance on the single-person MPII benchmark despite a large reduction in number of parameters (Table 4). This suggests that the parameter efficient MBConvs provide value in HPE as with other computer vision tasks such as image classification and object detection. Thus, MBConv is a promising component for HPE networks, and it will be interesting to assess its effect when combined with other novel HPE architectures such as Hourglass and HRNet [24,36]. The use of EfficientNet as backbone, and the proposed cross-resolution feature extractor combining several Ef-ficientNets for improved handling of basic features, are also interesting avenues to further explore. From the present study, we conjecture that EfficientNets could replace commonly used backbones for HPE, such as VGG and ResNets reducing the computational overheads associated with these approaches [15,32]. Additionally, a cross-resolution feature extractor could serve precision-demanding applications by improving the performance on P CK h @10 (Table 6). We also observed that EfficientPose benefits from compound model scaling across resolution, width and depth. This benefit is reflected by incremental improvements in P CK h @50 and P CK h @10 from EfficientPose RT through Efficient-Pose IV (Table 3). By utilizing this benefit to further examine scalable ConvNets for HPE, we can obtain insight into appropriate sizes of HPE models (i.e. number Table 3 Performance of EfficientPose compared to OpenPose on the MPII validation dataset, as evaluated by efficiency (number of parameters and FLOPs, and relative reduction in parameters and FLOPs compared to OpenPose) and accuracy (mean P CK h @50 and mean P CK h @10)

Model
Parameters   of parameters), required number of FLOPs, and obtainable precision-levels.
In this study, OpenPose and EfficientPose were optimized on the general-purpose MPII Human Pose dataset. For many applications (e.g. action recognition and video surveillance) the variability in MPII may be sufficient for directly applying the models on real-world problems. Nonetheless, other particular scenarios deviate from this setting. MPII comprises mostly grown-up persons in a variety of every day indoor and outdoor activities [2]. However, in less natural environments (e.g. movement science laboratories or hospital settings) and with humans of different anatomical proportions such as children and infants [30], careful consideration must be taken. This could require fine-tuning of the MPII models on more specific datasets related to the problem at hand. In our experiments, EfficientPose was more easily trainable than OpenPose (Figure 4 and Table 5). This trait of rapid convergence suggests that it will be more convenient to utilize transfer learning on Efficient-Pose models on other HPE data.
EfficientPose may also perform multi-person HPE in a more computationally efficient manner compared to OpenPose. State-of-the-art methods for multi-person HPE are dominated by top-down approaches which require computation proportional to the number of individuals in the image [12,49]. In crowded scenes, topdown approaches are highly resource demanding. As the original OpenPose [6], we can take advantage of part affinity fields to group keypoints into persons to perform multi-person HPE in a bottom-up manner. This reduces the computational overhead to a single network inference per image, and hence we significantly reduce the computation for multi-person HPE.
The architecture of EfficientPose and the training process can be improved in several ways. First, the optimization procedure (Appendix B) was developed for maximum P CK h @50 accuracy on OpenPose, and simply reapplied to EfficientPose. Other optimization procedures may be more appropriate for EfficientPose, like alternative optimizers (e.g. Adam [20] and RM-SProp [43]), and other learning rate and sigma schedules. Second, only the backbone of EfficientPose was pretrained on ImageNet. This could restrict the level of accuracy on HPE as large-scale pretraining not only supplies robust basic features but also higher-level semantics. Thus, it will be valuable to assess the effect of pretraining on model precision in HPE. We can pretrain the majority of ConvNet layers on ImageNet, and retrain these on HPE data. Third, the proposed compound scaling of EfficientPose assume that the scaling relationship between resolution, width, and depth is defined by Equation 2 for both HPE and image recognition. However, the optimal compound scaling coefficients may supposedly differ for HPE, where precisionlevel is more dependent on image resolution than for image classification. Thus, further studies could conduct neural architecture search across different combinations of resolution, width, and depth to determine the optimal combination of scaling coefficients for HPE. Regardless the scaling coefficients, the scaling of detection blocks in EfficientPose could be improved. The block depth (i.e. number of Mobile DenseNets) slightly deviates from the original depth coefficient in EfficientNets based on the rigid nature of the Mobile DenseNets. A carefully designed detection block could address this challenge by providing more flexibility with regards to number of layers and the receptive field size. Fourth, the computation efficiency of EfficientPose can be further improved by the use of teacher-student network training (i.e. knowledge distillation) [4] to transfer knowledge from a high-scale EfficientPose teacher network to a low-scale EfficientPose student network. This technique has already shown promise in HPE when paired with the stacked hourglass architecture [24,50]. Sparse networks, network pruning, and weight quantization [5,11,46] may also be taken into account to facilitate increasingly accurate, responsive real-life systems for HPE. For high performance inference and deployment on edge devices further speed-up can be achieved by the use of specialized libraries such as NVIDIA TensorRT and TensorFlow Lite [10,42].
In summary, EfficientPose tackles single-person HPE with improved degree of precision compared to the commonly adopted OpenPose network [6]. Despite this, the EfficientPose models elicit large reduction in number of parameters and FLOPs. This is achieved by taking advantage of contemporary research within image recognition on computation efficient ConvNet components, most notably MBConvs and EfficientNets [29,39]. The EfficientPose models are made openly available, serving research initiatives on movement analysis and stimulating further research within high-precision and computation efficient HPE.

Conclusion
In this work, we have stressed the need for a publicly accessible method for single-person HPE that suits the demands in both precision and efficiency across various applications and computational budgets. To this end, we have presented a novel method called Efficient-Pose, which is a scalable ConvNet architecture leveraging a computational efficient multi-scale feature extractor, novel mobile detection blocks, pose association estimates, and bilinear upscaling. To yield model variants that are able to provide flexibility in the compromise between accuracy and efficiency, we have developed the EfficientPose approach to exploit model scalability in three dimensions: resolution, width, and depth. Our experimental results have demonstrated that the proposed approach provides computational efficient models, allowing real-time inference on edge devices. At the same time, our framework offers flexibility to be scaled up to deliver more precise keypoint estimates than commonly used counterparts, at an order of magnitude less parameters and FLOPs. Taking into account the versatility and high precision level of our proposed framework, EfficientPose provides the foundation for nextgeneration markerless movement analysis.
In our future work, we plan to develop new techniques to further improve the model effectiveness, especially in terms of precision, by investigating optimal compound model scaling for HPE. Moreover, we will deploy EfficientPose on a range of applications, to validate its applicability, as well as feasibility, in real-world scenarios.

Appendix A: Ablation study
To determine the effect of different design choices in the EfficientPose architecture, we carried out component analysis.

Training efficiency
We assessed the number of training epochs to determine the appropriate duration of training, avoiding demanding optimization processes. Figure 4 suggests that the largest improvement in model accuracy occurs until around 200 epochs, after which training saturates. Table 5 supports this observation with less than 0.4% increase in P CK h @50 with 400 epochs of training. From this, it was decided to perform the final optimization of the different variants of EfficientPose over 200 epochs. Table 5 also suggests that most of the learning progress occurs during the first 100 epochs. Hence, for the remainder of the ablation study 100 epochs were used to determine the effect of different design choices.

Cross-resolution features
The value of combining low-level local information with high-level semantic information through a cross-resolution feature extractor was evaluated by optimizing the model with and without the low-level backbone. Experiments were conducted on two different variants of the Ef-ficientPose model. On coarse prediction (P CK h @50) there is little to no gain in accuracy (Table 6), while for fine estimation (P CK h @10) some improvement (0.6 − 0.7%) is displayed taking into account the negligible cost of 1.02 − 1.06× more parameters and 1.03 − 1.06× increase in FLOPs.

Skeleton estimation
The effect of skeleton estimation through the approximation of part affinity fields was assessed by comparing the architecture with and without the single pass of skeleton estimation. Skeleton estimation yields improved accuracy with 1.3−2.4% gain in P CK h @50 and 0.2−1.4% in P CK h @10 (Table 7), while only introducing an overhead in size and complexity of 1.3−1.4× and 1.2 − 1.3×, respectively.

Number of detection passes
We also determined the appropriate comprehensiveness of detection, represented by number of detection passes.
EfficientPose I and II were both optimized on three different variants (Table 8). Seemingly, the models benefit from intermediate supervision with a general trend of increased performance level in accordance with number of detection passes. The major benefit in performance is obtained by expanding from one to two passes of keypoint estimation, reflected by 1.6 − 1.7% increase in P CK h @50 and 1.8 − 1.9% in P CK h @10. In comparison, a third detection pass yields only 0.5 − 0.8% relative improvement in P CK h @50 compared to two passes, and no gain in P CK h @10 while increasing size and computation by 1.3× and 1.2×, respectively. From these findings, we decided a beneficial trade-off in accuracy and efficiency would be the use of two detection passes.

Upscaling
To assess the impact of upscaling, implemented as bilinear transposed convolutions, we compared the results of the two respective models. Table 9 reflects that upscaling yields improved precision on keypoint estimates by large gains of 9.2 − 12.3% in P CK h @10 and smaller improvements of 0.5 − 1.1% on coarse detection (P CK h @50). As a consequence of increased output resolution upscaling slightly increases number of FLOPs (1.04 − 1.1×) with neglectable increase in model size.

Appendix B: Optimization procedure
Most state-of-the-art approaches for single-person pose estimation are extensively pretrained on ImageNet [35,36,51], enabling rapid convergence for models when adapted to other tasks, such as HPE. In contrast to these approaches, few models, including OpenPose [6] and EfficientPose, only utilize the most basic pretrained features. This facilitates construction of more efficient network architectures but at the same time requires careful design of optimization procedures for convergence towards reasonable parameter values.
Training of pose estimation models is complicated due to the intricate nature of output responses. Overall, optimization is performed in a conventional fashion by minimizing the mean squared error (MSE) of the predicted output maps Y with respect to ground truth valuesŶ across all output responses N .
The predicted output maps should ideally have higher values at the spatial locations corresponding to body part positions, while punishing predictions farther away from the correct location. As a result, the ground truth output maps must be carefully designed to enable proper convergence during training. We achieve this by progressively reducing the circumference from the true location that should be rewarded, defined by the σ parameter. Higher probabilities T ∈ [0, 1] are assigned for positions P closer to the ground truth position G (Equation 4).
The proposed optimization scheme ( Figure 5) incorporates a stepwise σ scheme, and utilizes stochastic gradient descent (SGD) with a decaying triangular cyclical learning rate (CLR) policy [33]. The σ parameter is normalized according to the output resolution. As suggested by Smith et al. [34], the large learning rates in CLR provides regularization in network optimization. This makes training more stable and may even increase training efficiency. This is valuable for network architectures, such as OpenPose and EfficientPose, less heavily concerned with pretraining (i.e. having larger portions of randomized weights). In our adoption of CLR, we utilize a cycle length of 3 epochs. The learning rate (λ) converges towards λ ∞ (Equation 5), where λ max is the highest learning rate for which the model does not diverge during the first cycle and λ min = λmax 3000 , whereas σ 0 and σ ∞ are the initial and final sigma values, respectively.