EfficientPose: Efficient Human Pose Estimation with Neural Architecture Search

Human pose estimation from image and video is a vital task in many multimedia applications. Previous methods achieve great performance but rarely take efficiency into consideration, which makes it difficult to implement the networks on resource-constrained devices. Nowadays real-time multimedia applications call for more efficient models for better interactions. Moreover, most deep neural networks for pose estimation directly reuse the networks designed for image classification as the backbone, which are not yet optimized for the pose estimation task. In this paper, we propose an efficient framework targeted at human pose estimation including two parts, the efficient backbone and the efficient head. By implementing the differentiable neural architecture search method, we customize the backbone network design for pose estimation and reduce the computation cost with negligible accuracy degradation. For the efficient head, we slim the transposed convolutions and propose a spatial information correction module to promote the performance of the final prediction. In experiments, we evaluate our networks on the MPII and COCO datasets. Our smallest model has only 0.65 GFLOPs with 88.1% PCKh@0.5 on MPII and our large model has only 2 GFLOPs while its accuracy is competitive with the state-of-the-art large model, i.e., HRNet with 9.5 GFLOPs.


Introduction
Human pose estimation has a wide range of multimedia applications in the real world, e.g., virtual reality, game interaction and assisted living. Human pose estimation targets at predicting the locations of human keypoints or parts in images. Traditional methods [42,26] are based on probabilistic graphical model or pictorial structures. Deep convolution neural networks boost the development of this field * Equal contribution. † Corresponding author. by directly regressing keypoints positions [35] or predicting the heatmap locations [23], solving this problem in an end-to-end manner. Latter top-down methods [38,30] achieve high accuracies on both MPII [1] and COCO [20] benchmarks. However, the backbone networks in most previous methods [38,7,19] often directly reuse the networks designed for image classification, e.g., ResNet [15], VGG [29], which result in sub-optimality for human pose estimation tasks. Moreover, existing human pose estimation methods over-pursue the accuracy but ignore the efficiency of the model, making it difficult to deploy the model on the resource-constrained computing devices commonly used in real-life scenarios. Recent multimedia applications require more efficient human pose estimation which can bring better interaction experience for users. However, current pose estimation algorithms cannot meet this requirement. In recent years, the emergence of neural architecture search (NAS) methods greatly accelerates the development of neural network design. Some pioneering works [46,27] take huge search cost to obtain an architecture with high accuracies on image classification tasks. Latter one-shot and differentiable NAS methods [2,22] greatly decrease the search cost while keeping the high performance of the searched architectures. NAS is also applied to cut down the FLOPs or latency of the architecture while the high accuracy is guaranteed [5,37]. Furthermore, some methods aim at directly searching on the target tasks for customized architecture design and performance promotion, e.g., semantic segmentation [21,44] and object detection [13,10,12]. However, searching for the backbone networks for segmentation, detection and pose estimation is very computational expensive. In previous pose estimation methods, the customized and automatic backbone design is rarely explored. For more optimized networks, searching for the backbone is more important as the backbone takes up a large part of the whole network and plays the role of feature extraction.
To directly search for the backbone for pose estimation, we propose to design an efficient human pose estimation network searching framework, namely EfficientPose. As shown in Fig. 2, our efficient network includes two main parts, efficient backbone and efficient head. To tackle the backbone performance bottleneck in terms of both accuracy and efficiency, the differentiable NAS method [22] is designed for lightening the computation cost of the backbone network and adapting the backbone architecture to pose estimation tasks. Moreover, we design an efficient pose estimation head which enables not only fast inference but also fast architecture search. In the head network, the transposed convolutions are made slimmer according to the width of the backbone. A spatial information correction (SIC) module is proposed to promote the quality of the feature maps used for generating the heatmaps. Overall, we promote the efficiency of the whole framework for pose estimation. The architectures generated by our framework produce high performance with low FLOPs and latencies, as shown in Fig. 1.
Our contributions can be summarized as follows: • We propose an efficient pose estimation network searching framework, which is the first method that performs backbone search for pose estimation. The differentiable NAS method adjusts and lightens the backbone network automatically. The head network is redesigned by the proposed slim transposed convolutions and spatial information correction module to get a better trade-off between computation cost and accuracy.
• We obtain an extremely efficient pose estimation network that costs only 0.65 GFLOPs with high performance (88.1% PCKh@0.5). Our large model has only 2 GFLOPs while its accuracy is competitive with the state-of-the-art large model, i.e., HRNet [30] with 9.5 GFLOPs. Besides, the good generalization ability of the proposed EfficientPose networks has been confirmed on the COCO dataset.

Related Work
Human Pose Estimation Previous methods of human pose estimation have achieved great success. DeepPose [35] first applies deep learning to pose estimation tasks, regarding pose estimation task as regression problems. The convolutional neural network finally outputs the keypoints positions directly. Most of work in recent years predict the location of heat maps, which allows the network to save more spatial information. Hourglass [23] proposes to stack repeating U-shape blocks with skip connections to promote overall performance. PyraNet [41] uses a pyramid residual module to obtain multi-scale information. CPN [7] adopts a GlobalNet to roughly localize keypoints and a RefineNet to explicitly handle hard keypoints. SimpleBaseline [38] aims at constructing a simplified network and applies transposed convolutions to get high-resolution heatmaps. HRNet [30] achieves state-of-the-art results on COCO [20] by maintaining high-resolution representations, while the multibranch framework produces high computation cost and is not friendly to inference on embedded devices. RSN [6] continues to push forward the results of this task, which uses attention mechanism and features aggregation with same level to get precise keypoint localization. Though previous methods have achieved high accuracies in human pose estimation tasks, most of them bear large computation cost and high latency. The efficiency of the network is rarely considered, which makes it difficult to be implemented in real-word scenarios. Bulat et al. [4] promote the performance of the binarized model to get a better trade-off between model size and accuracy. DU-Net [34] applies the quantification method to reduce the model size. FPD [43] uses a strong teacher network to supervise the small student networks for performance promotion. We aim at constructing a lightweight framework for pose estimation which achieves the high accuracy with a low computation cost.
Neural Architecture Search Recent NAS methods greatly promote the performance of neural networks. Early NAS works use reinforcement learning [46,32] or the evolutionary algorithm [27,9] to search for the architectures and achieve superior results than networks designed manually. Latter one-shot [2,3] or differentiable [22,5,37,8,39] NAS methods search for high-performance architectures with low computation cost. To further cut the search cost, some methods search for a cell structure and then stack it to build the final architecture [22,39]. NAS is also applied on promoting the model efficiency [31,32,5,37] by optimizing the FLOPs, latency etc. Meanwhile, NAS methods have been implemented on different tasks, e.g. semantic segmentation [21,44] and object detection [13,10,12]. In this paper, we implement differentiable NAS on the backbone  Figure 2. The framework of EfficientPose. For the backbone part, we use the NAS method to obtain more efficient and precise networks. The desired network is finally derived from the super network. For the head part, we slim the transposed convolution and use a SIC module to correct the spatial information in the feature maps after the transposed convolutions. The input "Conv3 × 3" denotes a plain 3 × 3 convolution followed by a 3 × 3 separable depthwise convolution [16].
network design targeted at human pose estimation. Our networks with NAS adjusted achieve a better trade-off between accuracy and efficiency compared to other state-of-the-art methods [38,30].

Method
In this section, we first introduce the method of differentiable neural architecture search, which is used for designing the backbone network automatically targeted at pose estimation. Secondly, we explain how the search space for the backbone network is designed and how we optimize both accuracy and latency. Finally, we illustrate the designing principles of our redesigned lightweight head. The whole framework is shown in Fig. 2.

Differentiable Neural Architecture Search
We use differentiable NAS [22,5,37] to customize the backbone network design on the pose estimation task. We formulate the NAS problem as a nested optimization problem where S represents the search space and w a denotes the operation weights of architecture a. We search for the architecture a by minimizing the loss L(a, w a ).
In the differentiable NAS method, the search space S is relaxed into a continuous representation. Specifically, in every layer of the architecture, we compute the probability of each candidate operation as where O denotes the set of candidate operations and α o denotes the architecture parameter of operation o in layer . The output of each layer is computed as a weighted sum of output tensors from candidate operations where x denotes the input tensor of layer . During the search process, the operation weights and architecture parameters are optimized alternately by stochastic gradient descent to minimize the loss L. The final searched architecture is derived based on the distribution of architecture parameters.

Efficient Backbone
Most previous methods [36,7,38,19] use image classification networks as the backbone for pose estimation, e.g., ResNet [15], MobileNetV2 [28]. To narrow the gap between classification and pose estimation tasks, we redesign the backbone network with the NAS method. We first give the details about the search space. Then we describe the method of optimizing both accuracy and latency of the model.

Search Space
We construct our search space based on the MobileNetV2 [28] network, which holds the great perfor- Table 1. Search space of the backbone. The first "Conv3×3" denotes the plain convolution with a 3×3 kernel size. "SepDepth3 × 3" denotes a separable depthwise convolution [16] with kernel size 3×3. "TransConv" denotes the transposed convolution. s2 denotes the stride of 2. EfficientPose-A is searched under the small setting of the search space, while EfficientPose-B and -C are searched under the large setting.

Stage
Output  [38,30], we only perform 4 down-sampling operations in the backbone network (5 in most other methods). We consider that in the lightweight pose estimation network, 5× down-sampled feature maps contribute little to the final prediction and may lose much information. The higher-resolution (4×) feature maps are used for up-sampling. We give the details of the backbone network in Tab. 1.

Cost Optimization
The cost optimization of a neural network is of critical importance and can be measured by different benchmarks, e.g., FLOPs, latency. To obtain the high-performance network with low cost, we integrate the cost optimization into the search formulation. As most differentiable NAS methods [5, 37,11] do, we build a lookup table to predict the cost of the architecture during search. The cost of one layer is computed as where cost o is the cost of operation o in layer and p o denotes the corresponding probability computed by architec-ture parameters. The latency of the whole network can be computed as We add the cost regularization to the loss function for multiobjective optimization. The loss function during the search is defined as wherem k is the predicted heatmap of the k-th joint while m k is the ground truth, λ and τ are hyperparameters to balance MSE loss and latency regularization.

Efficient Head
The head of the pose estimation network aims to generate high-resolution heatmaps. To obtain a more efficient network, we redesign the head of the network. We first propose the slim transposed convolutions which produce highquality heatmaps with computation cost greatly decreased. Then we propose a spatial information correction (SIC) module which makes the spatial information of the highresolution representation more reliable. The SIC module promotes the prediction performance with a negligible computation cost.
Slim Transposed Convolutions Different from previous methods [38,30], the backbone of our network outputs the feature maps with a small number of channels. Accordingly, we cut down the width of the transposed convolutions. Latter experiments show the effectiveness of our slim transposed convolutions. Moreover, we explore more efficient upsampling convolutions in experiments, e.g., we use the separable depthwise convolution [16] to perform upsampling, and achieve great performance as well.
Spatial Information Correction As shown in Fig. 6, after the transposed convolutions, the feature maps present a checkerboard pattern of artifacts [24], which is caused by the uneven overlap of the transposed convolutions. To promote the quality of feature maps for heatmap generating, we add a spatial information correction (SIC) module, a 3 × 3 depthwise convolution in our implementation, after the transposed convolutions. With the SIC module, the checkerboard artifact pattern is almost eliminated which remarkably promotes the performance of the prediction with negligible computation cost. We perform more studies of the module in experiments.

Experiments
In this section, we first perform the experiments on the MPII [1] dataset. We describe the implementation details and compare the EfficientPose networks with other state-ofthe-art (SOTA) methods and different backbone networks. Then the generalization ability of EfficientPose networks is demonstrated on the COCO [20] dataset. We further specialize EfficientPose network design on different model cost benchmarks. Finally, ablation studies are carried out to show the effectiveness and efficiency of different modules in our framework.

Experiments on MPII
Implementation Details The MPII [1] dataset contains approximately 25K images with about 40K people. Following the standard training settings [41,38,30], all input images are cropped to 256×256 for fair comparisons. For the backbone architecture search, we randomly split the training data into two parts, 80% for operation weight training and 20% as the validation set to update architecture parameters. The original validation set is never used in the search process. We adopt the same data augmentation strategy as SimpleBaseline [38].
Before the search process, we build a lookup table for the latency of each operation. The latency is measured on a single GeForce RTX 2080Ti GPU with a batch size of 32. For the backbone architecture search, we first train the operation weights for 60 epochs which takes 6.5 hours on 2 2080Ti GPUs. Then we start the joint optimization by alternately train operation weights and architecture parameters in each epoch. For operation weight updating, we use the SGD optimizer with 0.9 momentum and 1e-4 weight decay. The initial learning rate is set as 0.05 and is gradually decayed to 0 following the cosine annealing schedule. For architecture parameters optimizing, we use the Adam optimizer with a fixed learning rate of 3e-4. The batch size is set as 32. The joint optimization process takes 150 epochs, 19 hours on 2 2080Ti GPUs. The whole search process only takes 25.5 hours on 2 GPUs, 51 GPU hours in total.
We train the derived network for 200 epochs using the Adam optimizer with the initial learning rate of 1e-3 and a batch size of 32. The learning rate is decayed by 0.1 at 150 epoch and 170 epoch respectively. The fast neural network adaptation method (FNA) [10,12] is implemented for the initialization of both the super network in the search process and the derived network for efficient training.

Comparisons with SOTA Pose Estimation Networks
The results on the MPII validation dataset are shown in Tab. 2. We search for three networks with different scales and compare them with the SOTA methods. Our Efficient-Pose networks achieve competitive and even higher perfor-  Figure 4. The architectures of EfficientPose-A and -B. We denote the MBConv blocks with diverse kernel sizes and expansion ratios as colored rectangle boxes. "Kx Ey" denotes the block with kernel size "x" and expansion ratio "y". Blocks in the same stage (with the same width) are contained in the dashed box.
(55.47ms). The GPU latency comparison is visualized in Fig. 3. And we show the searched architectures in Fig. 4. Comparisons with Different Backbone Networks To further demonstrate the performance of our searched lightweight backbones, we compare our networks with others with only backbone networks changed, including manually engineered networks [15,28] and NAS networks [5]. We train all the compared networks under the same training settings and hyperparameters. We only perform 4 downsampling operations in the compared models for a fair comparison.
The comparison results are shown in Tab. 3. In the first group, the models are all constructed with MBConv blocks [28]. EfficientPose-A achieves the highest PCKh@0.5 with the lowest FLOPs. Furthermore, compared with the large model ResNet-50 [15], EfficientPose-A achieves similar performance with 14.9× less FLOPs.

Generalization Ability on COCO
We implement our models searched on MPII to COCO [20]. Only the output dimension of the final 1×1 convolution is adjusted as the number of keypoints changes. COCO contains about 200K images with about 250K person instances. We adopt the same data augmentation strategy as HRNet [30]. The input size is set as 192 × 256. The whole training process takes 240 epochs using the Adam optimizer with the batch size of 128. We use the warm-up strategy in the first 500 iterations to linearly increase the learning rate to 1e-3. Then the learning rate is decayed by 0.1 at 200K and 240K iteration respectively.
As shown in Tab. 4, our EfficientPose networks achieve high AP with lowest FLOPs on COCO. With nearly 500 MFLOPs, the EfficientPose-A network achieves a comparable result compared with Hourglass [23], CPN [7] and LPN [45]. EfficientPose-B surpasses both LPN [45] networks and SimpleBaseline-R50 with only 1.1 GFLOPs. EfficientPose-C also obtains a high AP with far less FLOPs. The effectiveness of EfficientPose networks is demonstrated on the COCO test set in Tab. 5 as well.

Specialization on Different Hardwares
We specialize our EfficientPose network design on three different model cost benchmarks, i.e., FLOPs, GPU latency and CPU latency. We achieve this by remeasure the cost term in the loss function (Eq. 6) of the backbone architecture search. The GPU latency is measured on one Tesla V100 GPU with batch size 32 and the CPU latency is measured on 1 Intel(R) Core(TM) i7-8700K CPU with a batch size of 1. For each cost benchmark, we search for two networks and visualize the results in Fig. 5. This experiment shows that our pose estimation framework can be efficient on diverse hardware platforms. The specialization process can be easily performed by building different lookup tables on the cost benchmarks for predicting the model cost. Though FLOPs is widely used for evaluating the cost of the model, the results in Fig. 5 and some recent works [31,5] demonstrate that FLOPs is not so correlated with the real in-ference speed due to the discrepancy of different hardware platforms. The ability to specialize models on the target device is so vital for real applications.

Ablation Studies
Effectiveness of SIC module We visualize the feature maps in the network to study the effect of the spatial information correction (SIC) module in Fig. 6. We find that there exists the checkerboard pattern of artifacts [24] in both the feature maps after the transposed convolutions in the network trained without SIC and the feature maps before SIC in EfficientPose networks. The SIC module eliminates the checkerboard pattern in the feature maps obtained by transposed convolutions and makes the interested field more concentrated, which contributes to the final prediction. We further perform the ablation studies of SIC in Tab. 6. The SIC module shows evident accuracy promotion in both the man- ually designed network MobileNetV2 and the EfficientPose networks.  Efficiency of Slim Transposed Convolutions We set the widths of the two transposed convolutions in the head as 64 and 32 respectively, which are always set larger in previous works, e.g., 256 in SimpleBaseline [38]. To further demonstrate the efficiency of our slim transposed convolutions, we compare with larger width settings of the two transpose convolutions in the head. As shown in Tab. 7, when we enlarge the widths to 64 and 64, the FLOPs in the head increase 134M but no accuracy promotion is obtained. When we set the widths larger, 96 and 96, the FLOPs increase 520M and a worse PCKh@0.5 is get. It is worth noting the FLOPs of the (96, 96) head is much larger than the backbone. We attribute the performance degradation to overfitting when the head is too heavy with a lightweight backbone. Studies on Efficient Transposed Convolutions We study lighter convolution modules for the transposed convolutions, the separable depthwise convolution in Mo-bileNetV1 [16] (MBV1) and the inverted residual block in MobileNetV2 [28] (MBV2). As shown in Tab. 8, we find that the MBV1 style module decreases the FLOPs by 181M with acceptable accuracy decay, which can be an option for the efficient head as well.
Comparisons with Random Search We perform the random search experiment, which is a vital baseline for evaluating the effectiveness of the NAS method [18]. We randomly sample 50 architectures and train each one for 5 epochs to evaluate the validation performance. Finally, we select the best-performance one and train it with the same settings as our EfficientPose. The total cost of random search is the same as ours. The results are shown in Tab. 3. Our EfficientPose-A shows an evident advantage over the random searched one.

Conclusion
In this paper, we propose a framework targeted at efficient human pose estimation. We use the differentiable NAS method to automatically customize the backbone network for pose estimation and reduce the computation cost greatly. We further design an efficient head network which includes both the slim transposed convolutions and a spatial information correction module to promote the prediction performance with negligible FLOPs/latency increase. The proposed EfficientPose networks achieve similar accuracies with far less computation cost compared to other SOTA methods.  [5] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019. 2, 3, 4, 6, 7