Performance benchmark of deep learning human pose estimation for UAVs

Kalampokas, Theofanis; Krinidis, Stelios; Chatzis, Vassilios; Papakostas, George A.

doi:10.1007/s00138-023-01448-5

Performance benchmark of deep learning human pose estimation for UAVs

Original Paper
Open access
Published: 07 September 2023

Volume 34, article number 97, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Vision and Applications Aims and scope Submit manuscript

Performance benchmark of deep learning human pose estimation for UAVs

Download PDF

2228 Accesses
2 Citations
Explore all metrics

Abstract

Human pose estimation (HPE) is a computer vision application that estimates human body joints from images. It gives machines the capability to better understand the interaction between humans and the environment. For this accomplishment, many HPE methods have been deployed in robots, vehicles, and unmanned aerial vehicles (UAVs). This effort raised the challenge of balance between algorithm performance and efficiency, especially in UAVs, where computational resources are limited for saving battery power. Despite the considerable progress in the HPE problem, there are very few methods that are proposed to face this challenge. To highlight the severity of this fact, the proposed paper presents a brief review and an HPE benchmark from the aspect of algorithms performance and efficiency under UAV operation. More specifically, the contribution of HPE methods in the last 22 years is covered, along with the variety of methods that exist. The benchmark consists of 36 pose estimation models in 3 known datasets with metrics that fulfill the paper aspect. From the results, MobileNet-based models achieved competitive performance and the lowest computational cost, in comparison with ResNet-based models. Finally, benchmark results are projected in edge devices hardware specifications to analyze the appropriateness of these algorithms for UAV deployment.

Deep Learning-Based 2D and 3D Human Pose Estimation: A Survey

Human Pose Estimation in UAV-Human Workspace

Efficient High-Resolution Human Pose Estimation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Human pose estimation can be characterized as one of the most fundamental topics that concern the CV research community because there are many higher-level applications that benefited from this. One of the first methods that have been used for this problem is part-based [1] and the flexible mixture of parts [2]. Both of these methods are using template matching or local detectors to localize body parts in RGB images and correlate them, in order to find the human pose joints. Unfortunately, these methods require hand-crafted features like color, edges, or feature descriptions from signal transformations, leading to a computationally intense solution with low performance in real environment conditions. The adoption of convolution neural networks (CNNs) in HPE, which occurred in 2014 [3], gave a significant boost in the solution performance due to the large generalization capability and the automated feature extraction process. The most recent advance around this topic is 3D HPE where one of the first and most mature methods is placed chronologically in 2017 [4]. HPE methods can be categorized into single-person and multiple-person pose estimation. Figure 1 shows a visual representation of these two approaches along with their sub-categories.

Single-person pose estimation solves the problem of predicting human body joints, which requires the input image content to have only a single person. In multiple human scenarios, a method is required in order to crop the image in sub-images of individual persons. Multiple-person pose estimation includes the localization of body joints in an image with multiple persons. In this category, the top-down approaches are similar to single person because they first require the detection of all persons in the content of an image and then the prediction of the pose in each detected human. The bottom-up approach is more unconstrained, for the reason that in a given image with multiple persons, human body keypoints are estimated and then grouped to the person they belong. Multiple-person techniques and especially top-down are preferred under UAV deployment because the operating environment is usually crowded. From the data representation perspective, there are three main data structures of the human body pose. The kinematic model [5] is the most mainstream structure, where body joints are represented as points. An advantage of this model is its low computational cost. A disadvantage is that it does not contain any texture or shape features about the human body pose. The planar model as used in [6] is a scheme, where each body part is represented as a rectangle in order to form the human body. An advantage of this representation is the addition of shape information about human body contour. A disadvantage is that in high occlusion situations, the performance drops significantly. Finally, the volumetric model as presented in [7] is a human body representation that comes from 3D data where geometric shapes and body meshes are combined to produce the richest data structure of the human body. Under this model, each point in the human body can have 1–3 degrees of freedom (DOF). Its advantage is the rich information about the human body pose, and its disadvantage is the high computational cost. To sum up, the diversity of HPE techniques and representations is covered at an abstract level. The aim of this work is: (1) to highlight the evolution of HPE techniques (focus on CNN-based), (2) to present HPE challenges under the perspectives of problem-solving performance and algorithms computational footprints, and (3) to present benchmark results of 36 2D CNN-based models under 3 known datasets MS-COCO [8], Occluded Humans [9] and UAV-Human [10]. The benchmark is limited in 2D data, because of the scarcity in 3D data under UAV conditions and complex conditions of scales and occlusions. The contribution of this work is:

1.
An exhaustive benchmark of HPE algorithms under the scope of performance and efficiency in UAV operation.
2.
For the first time, computational projection under numerous edge device specifications is conducted for HPE algorithms.

This study is primarily motivated by the necessity to develop UAV-based systems for cinematography. The objective is to utilize HPE (human pose estimation) computer vision algorithms as part of a research project.

The structure of this paper is as follows: Sect. 2 presents related work around HPE techniques based on the classification made above, along with some advantages and disadvantages of them in the aspects of performance and efficiency. In the same section, a literature analysis is presented to highlight the evolution of HPE techniques over the past 22 years and correlate them with known challenges. Section 3 presents the benchmark properties, algorithms, metrics, and datasets along with the conducted process scheme. Section 4 analyzes the benchmark results under the above-mentioned aspects in order to highlight the most balanced algorithms between these two concepts. Section 5 presents a projection of benchmark efficiency results with edge computing hardware specifications, in order to highlight the most suitable for this situation. Finally, in Sect. 6 conclusions are presented and a discussion is made around HPE algorithms.

2 Related work

In this section, related work will be presented based on the variations of HPE methods. More emphasis will be given to CNN-based approaches. Some common challenges faced by all HPE methods are, human occlusions, scale, and pose variations, especially in crowd images. More specifically in UAVs, fast movements force the camera image to produce blur and noise from fast light condition changes. In addition, a far distance from the target makes the scale smaller and human occlusions a more difficult challenge.

2.1 Single-person HPE

Starting from single-person pose estimation, it is certain that there exist more well-performed solutions in comparison with multiple-person pose estimation because they do not require any extra method to isolate individual persons. In comparison between them, multiple-person methods are more computationally intensive because there do not exist many all-in-one models that produce human bounding boxes and poses at the same time. Thus, cooperation of two different models is demanded, which raises the hardware resources consumption. In addition, under this approach, the pose estimation model is highly constrained to object detection performance. Single-person approaches typically solve a regression problem. In direct regression, a single heatmap is produced and used for regression directly. This technique is the most simple and efficient approach but with some drawbacks like low performance in the correlation of the relationship between the human body joints. To overcome this challenge in [11], the authors proposed a method to calculate prediction errors in order to refine the predicted poses. Another challenge in this approach is the pooling operation of the CNN models, which destroys the relationship and locality of joints. The heatmap-based technique produces heatmaps for each human body joint in order to predict joint locations [12, 13]. In a comparison of those two techniques, direct regression has many difficulties in mapping image content to human pose under a single heatmap which often requires much higher image resolutions. On the other hand, heatmap-based approaches require heuristic heatmaps manipulation in order to form the final pose, which makes it more complex. From the perspective of efficiency, direct regression approaches have higher deployment capabilities because they have a single structure and thus they can easily be optimized with known inference engines in contradiction to heatmap-based models which require custom plugins in order to succeed in inference optimization. From the perspective of UAVs, direct regression can achieve better efficiency but worse performance considering the challenges that might occur. On the other side, heatmap-based might be considered preferable since it can handle more challenges like occlusions and scales.

2.2 Multiple-person HPE

Multiple-person pose estimation is more challenging since their computational cost is constrained by the number of persons. Under this category, there exist two common approaches top-down and bottom-up methods. Top-down methods are similar to single person, and thus, they share the same disadvantages regarding efficiency and deployment challenges. A proposed implementation that overcomes this challenge is Mask-RCNN [14], which can predict human bounding boxes and human joints through the same feature map in the backbone. Another interesting method is HRNet [15], where through down-sampling and up-sampling sub-networks low- and high-level features are generated from input images leading to better human pose estimations. The bottom-up approach discards the object detection constraint because it detects all human joints in an image and then groups them for each person. One of the breakthroughs in this method is Open-Pose [16] which uses a nonparametric human body joints representation called Part Affinity Fields (PAFs). Under this approach, the feature encoding includes both human joint position and orientation creating a confidence map that proved more robust in the grouping procedure. Another interesting proposition came in [17] where they used a method for supervised convolution networks to perform detection and grouping called associative embedding. Under this approach, two kinds of heatmaps are produced simultaneously, one for human joint detection and one for joint grouping. Both of these methods have their advantages and disadvantages. Bottom-up can distinguish human poses even in high occlusion situations, while top-down performs worse. Another fact is that bottom-up methods often use very large network structures in order to achieve highly robust features and estimations which makes them computationally costly, especially under edge devices deployment. Finally, top-down are most suitable for UAVs deployment because they give emphasis on person detection as well. Thus, they can handle challenges better in comparison with bottom-up which might have difficulties with small scales and high occlusions.

2.3 Literature analysis

In many scientific papers, books, and journals, the HPE problem is mentioned as one of the hottest topics that concern the CV community. In this section, a presentation and analysis are made based on data retrieved from Publish or Perish software [18] with Google Scholar as a source. The aims of this section are to highlight the significance of HPE and the proposed works and to address these works with the aspects of the proposed paper. In addition, an analysis is made from the perspective of 3D HPE as a new approach to this problem. The keywords that were searched are 2D HPE AND 3D HPE. In Fig. 2, the published papers for the above specified time range are presented for 2D and 3D accordingly.

Based on Fig. 2, it is clear that HPE gains an uptrend very close to the appearance of CNNs in this problem. Particularly, the last year’s 3D methods are in the scope of concern, and the contribution around this topic is higher. In order to analyze more in depth in Fig. 3, the citations by the published scientific papers for each year and for each keyword are presented.

As mentioned above, 2014 was the year that 2D HPE has been first proposed with a CNN-based method and according to Fig. 3, the citations are very high for the number of papers that were published that year. This means that in 2014 the ground base was formed for the maturation of CNN-based methods. This is justified by the produced citations in the next years around 2D HPE. From 2020 and after the citations decreased because published papers are very new. Similar to 2D HPE, the 3D approach had a significant impact in 2017 which is approximately the year, when 3D HPE was solved by CNN models with more 3D sensors like Kinect. Comparing the two figures, it can be mentioned that 2D HPE also meets a significant impact from the 3D method appearance. Based on Fig. 3, it is clear that 3D is a very new approach that has some limitations as will be presented next, and in contradiction, 2D HPE has started to entrench. With the implementation of the appropriate Python code, the above-collected papers are analyzed, searching the keywords UAVs, drones, lightweight, and edge devices in their titles and abstracts. In Fig. 4, the published papers that include the above keywords for 2D and 3D HPE are presented, accordingly.

UAVs are a very new and challenging topic for CV applications around humans like surveillance, data mining, action recognition, threat detection, and object tracking. In all these applications, pose estimation is a vital step in the extraction process of robust high-level information. An efficiency challenge around this kind of system setup is that CV algorithms need to be deployed on edge devices that have limited hardware resources in order to be power-friendly since UAVs usually work with batteries. In addition, the input images have more complex content like scale variations, motion blurs, occlusions, and light changes. The development of HPE methods under these conditions needs to balance performance and efficiency in order to be appropriate for real environment situations under the scope of UAV deployment and application. Based on Fig. 4, there are only a few proposed papers that deal with these challenges and it seems that is a very complex problem, where for 2D HPE there are only 28 papers and 10 for 3D HPE. There are numerous reasons that can justify this lack, which are the high 3D sensor noise in outdoor environments, computational intensive from the perspective that 3D HPE approaches are extracting more information than 2D, or they include a 3D reconstruction process in order to find human body joints.

2.4 2D HPE literature analysis

In [19], the authors proposed a pictorial method for HPE for multiple-person detection and classification. Their performance measurements show that for 15 persons the algorithm required 1.5 min of process and the overall accuracy was 71%. In [20], lightweight pose estimation model (LPE) is proposed which in its smallest form has 1 GFLOP of operations and achieved 17 frames per second (FPS) with 67% of average precision (AP). To achieve this, they exploited an attention mechanism that found local pixel-level relationships of human joints in an image context. In [21], another CNN-based approach is proposed which achieved 85% AP in the COCO benchmark dataset [8] with only 4 million parameters. To achieve this, the authors incorporate geometrical and structural information from the input data. Unfortunately, there is no reference to speed and the test environment is not from UAV. Another interesting method proposed in [22] used two pyramid attention networks in order to achieve multi-scale feature fusion for better feature representation in HPE with 1.5 million network parameters achieving up to 80% AP in known datasets. Like the previous, the test environments are not in UAVs or edge devices. A very representative method proposed in [23] exploits a network architecture optimization that includes an encoder–decoder backbone with deconvolution layers. The authors notice that the optimization of the deconvolution layer decreases, the computational cost keeping accuracy stable. The achieved results were up to 90% accuracy and 60 FPS in a Jetson AGX board. Similar to LPNet in [24], the authors used an attention mechanism and ghost module that can be found in known models like MobileNet and ShuffleNet. The results show that their model achieved up to 90% with only 1.7 million parameters. Continuing with the attention mechanism approaches in [25], the authors used HRNet [15] as a backbone and a transformer module to apply feature encoding before a regression head. With this addition, the authors achieved a reduction of 86% in network parameters but without any reference around UAVs and edge device deployment. On a more general level, 2D HPE approaches that have been proposed under the perspective of efficiency and performance exploit attention mechanisms that raise the feature representation richness and decrease network architecture parameters [26,27,28,29,30,31,32]. Unfortunately, all these papers do not include UAVs application environment, which might change the performance metrics and their main scope of interest is only around lightweight solutions, which fulfill the efficiency aspect and are measured only in desktop computer machines. From the above-presented papers, only 1 is before 2019, while the rest of them are after, which is justified in Fig. 4.

2.5 3D HPE literature analysis

In the 3D approach for HPE, an interesting proposition was made in [33]. The authors proposed a lightweight model with MobileNet backbone in order to predict 3D human poses from 2D input images. The results show that they can achieve 37 FPS from a mobile device with more than 80% AP. Unfortunately, there was no application under UAV conditions where more complex challenges occur. Another similar approach is [34], where authors used YoloV2 for human detection and then with heatmap-based regression and 3D generation module the 3D human pose and bounding box are formed. Unfortunately, this work also has only been tested in an indoor environment with an animated simulator. In continuing [35] proposed a method with 2D body joint detection from multiple views with edge sensors. Exploiting a 3D human body model, bone distances, and multi-view triangulation, the 3D human pose is reconstructed. To achieve lightweight model architecture in 2D human body joints, they exploited a MobileNetV3 backbone and a direct regression head. In addition, they quantize the model to infer an 8-bit integer input image in a TPU edge device. Unfortunately, the test environment was indoor conditions with static light and close human target distance from sensors. In [36], the authors used multiple cameras and a transformer model which estimates human body joints with the heatmap-based method. In order to add 3D information to the transformer model, they included the camera parameters in the joint position approximation step after the 2D heatmap generation. In the end, the produced heatmaps are feature encoded by the transformer, and finally regression head predicted the 3D human pose. The proposed method achieved 32 ms inference time with only 5 million network parameters and 25 mm error compared to ground truth joints. Same to the previous works, the test environment was made in indoor conditions with no edge device usage. A revolutionary method was proposed in [37], where the authors exploited three known 3D models as backbones (PointNet, DGCNN, and Point Transformer) and fuse them with two linear layers in order to estimate 3D human poses. Their input data were 2D and 3D, achieving 26 ms inference time at Jetson Xavier. The drawback of this approach is that the test environment was indoors, which plays a vital role because; in high light variations, the point cloud data might be very sparse. To summarize, 3D pose estimation approaches are at a very new level because require stable light conditions and a comfortable distance from the target. In addition, the domain of deployment in edge devices remains largely unexplored with limited existing research. The scarcity of 3D data further compounds the challenges in evaluating 3D HPE methods. Notably, this paper focuses on addressing the critical aspects of performance and efficiency specifically under UAVs and edge devices, making it a challenging endeavor for 3D HPE approaches. To provide a comprehensive benchmark analysis, this study is primarily constrained to utilizing 2D data due to the complexities involved in dealing with 3D data.

3 Benchmark settings

In this section, the datasets, metrics, and algorithms of the proposed benchmark are presented. The process flow of the conducted benchmark was dataset initialization and load, model selection and load, model inference iteratively for each image, and metrics calculation for performance and efficiency. In Fig. 5, the benchmark phases are visually presented.

The benchmark software code was implemented with Python programming language exploiting a variety of frameworks, wherein Table 1 is presented.

Table 1 Benchmark frameworks and methods

Performance benchmark of deep learning human pose estimation for UAVs

Abstract

Similar content being viewed by others

Deep Learning-Based 2D and 3D Human Pose Estimation: A Survey

Human Pose Estimation in UAV-Human Workspace

Efficient High-Resolution Human Pose Estimation

Explore related subjects

1 Introduction

2 Related work

2.1 Single-person HPE

2.2 Multiple-person HPE

2.3 Literature analysis

2.4 2D HPE literature analysis

2.5 3D HPE literature analysis

3 Benchmark settings

3.1 Benchmark datasets

3.2 Benchmark metrics

3.3 Benchmark HPE models

4 Benchmark results and discussion

5 Benchmark results projection in edge device specifications

6 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation