1 Introduction

Human pose estimation can be characterized as one of the most fundamental topics that concern the CV research community because there are many higher-level applications that benefited from this. One of the first methods that have been used for this problem is part-based [1] and the flexible mixture of parts [2]. Both of these methods are using template matching or local detectors to localize body parts in RGB images and correlate them, in order to find the human pose joints. Unfortunately, these methods require hand-crafted features like color, edges, or feature descriptions from signal transformations, leading to a computationally intense solution with low performance in real environment conditions. The adoption of convolution neural networks (CNNs) in HPE, which occurred in 2014 [3], gave a significant boost in the solution performance due to the large generalization capability and the automated feature extraction process. The most recent advance around this topic is 3D HPE where one of the first and most mature methods is placed chronologically in 2017 [4]. HPE methods can be categorized into single-person and multiple-person pose estimation. Figure 1 shows a visual representation of these two approaches along with their sub-categories.

Fig. 1
figure 1

HPE methods variations

Single-person pose estimation solves the problem of predicting human body joints, which requires the input image content to have only a single person. In multiple human scenarios, a method is required in order to crop the image in sub-images of individual persons. Multiple-person pose estimation includes the localization of body joints in an image with multiple persons. In this category, the top-down approaches are similar to single person because they first require the detection of all persons in the content of an image and then the prediction of the pose in each detected human. The bottom-up approach is more unconstrained, for the reason that in a given image with multiple persons, human body keypoints are estimated and then grouped to the person they belong. Multiple-person techniques and especially top-down are preferred under UAV deployment because the operating environment is usually crowded. From the data representation perspective, there are three main data structures of the human body pose. The kinematic model [5] is the most mainstream structure, where body joints are represented as points. An advantage of this model is its low computational cost. A disadvantage is that it does not contain any texture or shape features about the human body pose. The planar model as used in [6] is a scheme, where each body part is represented as a rectangle in order to form the human body. An advantage of this representation is the addition of shape information about human body contour. A disadvantage is that in high occlusion situations, the performance drops significantly. Finally, the volumetric model as presented in [7] is a human body representation that comes from 3D data where geometric shapes and body meshes are combined to produce the richest data structure of the human body. Under this model, each point in the human body can have 1–3 degrees of freedom (DOF). Its advantage is the rich information about the human body pose, and its disadvantage is the high computational cost. To sum up, the diversity of HPE techniques and representations is covered at an abstract level. The aim of this work is: (1) to highlight the evolution of HPE techniques (focus on CNN-based), (2) to present HPE challenges under the perspectives of problem-solving performance and algorithms computational footprints, and (3) to present benchmark results of 36 2D CNN-based models under 3 known datasets MS-COCO [8], Occluded Humans [9] and UAV-Human [10]. The benchmark is limited in 2D data, because of the scarcity in 3D data under UAV conditions and complex conditions of scales and occlusions. The contribution of this work is:

  1. 1.

    An exhaustive benchmark of HPE algorithms under the scope of performance and efficiency in UAV operation.

  2. 2.

    For the first time, computational projection under numerous edge device specifications is conducted for HPE algorithms.

This study is primarily motivated by the necessity to develop UAV-based systems for cinematography. The objective is to utilize HPE (human pose estimation) computer vision algorithms as part of a research project.

The structure of this paper is as follows: Sect. 2 presents related work around HPE techniques based on the classification made above, along with some advantages and disadvantages of them in the aspects of performance and efficiency. In the same section, a literature analysis is presented to highlight the evolution of HPE techniques over the past 22 years and correlate them with known challenges. Section 3 presents the benchmark properties, algorithms, metrics, and datasets along with the conducted process scheme. Section 4 analyzes the benchmark results under the above-mentioned aspects in order to highlight the most balanced algorithms between these two concepts. Section 5 presents a projection of benchmark efficiency results with edge computing hardware specifications, in order to highlight the most suitable for this situation. Finally, in Sect. 6 conclusions are presented and a discussion is made around HPE algorithms.

2 Related work

In this section, related work will be presented based on the variations of HPE methods. More emphasis will be given to CNN-based approaches. Some common challenges faced by all HPE methods are, human occlusions, scale, and pose variations, especially in crowd images. More specifically in UAVs, fast movements force the camera image to produce blur and noise from fast light condition changes. In addition, a far distance from the target makes the scale smaller and human occlusions a more difficult challenge.

2.1 Single-person HPE

Starting from single-person pose estimation, it is certain that there exist more well-performed solutions in comparison with multiple-person pose estimation because they do not require any extra method to isolate individual persons. In comparison between them, multiple-person methods are more computationally intensive because there do not exist many all-in-one models that produce human bounding boxes and poses at the same time. Thus, cooperation of two different models is demanded, which raises the hardware resources consumption. In addition, under this approach, the pose estimation model is highly constrained to object detection performance. Single-person approaches typically solve a regression problem. In direct regression, a single heatmap is produced and used for regression directly. This technique is the most simple and efficient approach but with some drawbacks like low performance in the correlation of the relationship between the human body joints. To overcome this challenge in [11], the authors proposed a method to calculate prediction errors in order to refine the predicted poses. Another challenge in this approach is the pooling operation of the CNN models, which destroys the relationship and locality of joints. The heatmap-based technique produces heatmaps for each human body joint in order to predict joint locations [12, 13]. In a comparison of those two techniques, direct regression has many difficulties in mapping image content to human pose under a single heatmap which often requires much higher image resolutions. On the other hand, heatmap-based approaches require heuristic heatmaps manipulation in order to form the final pose, which makes it more complex. From the perspective of efficiency, direct regression approaches have higher deployment capabilities because they have a single structure and thus they can easily be optimized with known inference engines in contradiction to heatmap-based models which require custom plugins in order to succeed in inference optimization. From the perspective of UAVs, direct regression can achieve better efficiency but worse performance considering the challenges that might occur. On the other side, heatmap-based might be considered preferable since it can handle more challenges like occlusions and scales.

2.2 Multiple-person HPE

Multiple-person pose estimation is more challenging since their computational cost is constrained by the number of persons. Under this category, there exist two common approaches top-down and bottom-up methods. Top-down methods are similar to single person, and thus, they share the same disadvantages regarding efficiency and deployment challenges. A proposed implementation that overcomes this challenge is Mask-RCNN [14], which can predict human bounding boxes and human joints through the same feature map in the backbone. Another interesting method is HRNet [15], where through down-sampling and up-sampling sub-networks low- and high-level features are generated from input images leading to better human pose estimations. The bottom-up approach discards the object detection constraint because it detects all human joints in an image and then groups them for each person. One of the breakthroughs in this method is Open-Pose [16] which uses a nonparametric human body joints representation called Part Affinity Fields (PAFs). Under this approach, the feature encoding includes both human joint position and orientation creating a confidence map that proved more robust in the grouping procedure. Another interesting proposition came in [17] where they used a method for supervised convolution networks to perform detection and grouping called associative embedding. Under this approach, two kinds of heatmaps are produced simultaneously, one for human joint detection and one for joint grouping. Both of these methods have their advantages and disadvantages. Bottom-up can distinguish human poses even in high occlusion situations, while top-down performs worse. Another fact is that bottom-up methods often use very large network structures in order to achieve highly robust features and estimations which makes them computationally costly, especially under edge devices deployment. Finally, top-down are most suitable for UAVs deployment because they give emphasis on person detection as well. Thus, they can handle challenges better in comparison with bottom-up which might have difficulties with small scales and high occlusions.

2.3 Literature analysis

In many scientific papers, books, and journals, the HPE problem is mentioned as one of the hottest topics that concern the CV community. In this section, a presentation and analysis are made based on data retrieved from Publish or Perish software [18] with Google Scholar as a source. The aims of this section are to highlight the significance of HPE and the proposed works and to address these works with the aspects of the proposed paper. In addition, an analysis is made from the perspective of 3D HPE as a new approach to this problem. The keywords that were searched are 2D HPE AND 3D HPE. In Fig. 2, the published papers for the above specified time range are presented for 2D and 3D accordingly.

Fig. 2
figure 2

2D (red) and 3D (blue) HPE published papers by year

Based on Fig. 2, it is clear that HPE gains an uptrend very close to the appearance of CNNs in this problem. Particularly, the last year’s 3D methods are in the scope of concern, and the contribution around this topic is higher. In order to analyze more in depth in Fig. 3, the citations by the published scientific papers for each year and for each keyword are presented.

Fig. 3
figure 3

2D (red) and 3D (blue) HPE published papers citations by year

As mentioned above, 2014 was the year that 2D HPE has been first proposed with a CNN-based method and according to Fig. 3, the citations are very high for the number of papers that were published that year. This means that in 2014 the ground base was formed for the maturation of CNN-based methods. This is justified by the produced citations in the next years around 2D HPE. From 2020 and after the citations decreased because published papers are very new. Similar to 2D HPE, the 3D approach had a significant impact in 2017 which is approximately the year, when 3D HPE was solved by CNN models with more 3D sensors like Kinect. Comparing the two figures, it can be mentioned that 2D HPE also meets a significant impact from the 3D method appearance. Based on Fig. 3, it is clear that 3D is a very new approach that has some limitations as will be presented next, and in contradiction, 2D HPE has started to entrench. With the implementation of the appropriate Python code, the above-collected papers are analyzed, searching the keywords UAVs, drones, lightweight, and edge devices in their titles and abstracts. In Fig. 4, the published papers that include the above keywords for 2D and 3D HPE are presented, accordingly.

Fig. 4
figure 4

2D (red) and 3D (blue) HPE published papers with efficiency aspects

UAVs are a very new and challenging topic for CV applications around humans like surveillance, data mining, action recognition, threat detection, and object tracking. In all these applications, pose estimation is a vital step in the extraction process of robust high-level information. An efficiency challenge around this kind of system setup is that CV algorithms need to be deployed on edge devices that have limited hardware resources in order to be power-friendly since UAVs usually work with batteries. In addition, the input images have more complex content like scale variations, motion blurs, occlusions, and light changes. The development of HPE methods under these conditions needs to balance performance and efficiency in order to be appropriate for real environment situations under the scope of UAV deployment and application. Based on Fig. 4, there are only a few proposed papers that deal with these challenges and it seems that is a very complex problem, where for 2D HPE there are only 28 papers and 10 for 3D HPE. There are numerous reasons that can justify this lack, which are the high 3D sensor noise in outdoor environments, computational intensive from the perspective that 3D HPE approaches are extracting more information than 2D, or they include a 3D reconstruction process in order to find human body joints.

2.4 2D HPE literature analysis

In [19], the authors proposed a pictorial method for HPE for multiple-person detection and classification. Their performance measurements show that for 15 persons the algorithm required 1.5 min of process and the overall accuracy was 71%. In [20], lightweight pose estimation model (LPE) is proposed which in its smallest form has 1 GFLOP of operations and achieved 17 frames per second (FPS) with 67% of average precision (AP). To achieve this, they exploited an attention mechanism that found local pixel-level relationships of human joints in an image context. In [21], another CNN-based approach is proposed which achieved 85% AP in the COCO benchmark dataset [8] with only 4 million parameters. To achieve this, the authors incorporate geometrical and structural information from the input data. Unfortunately, there is no reference to speed and the test environment is not from UAV. Another interesting method proposed in [22] used two pyramid attention networks in order to achieve multi-scale feature fusion for better feature representation in HPE with 1.5 million network parameters achieving up to 80% AP in known datasets. Like the previous, the test environments are not in UAVs or edge devices. A very representative method proposed in [23] exploits a network architecture optimization that includes an encoder–decoder backbone with deconvolution layers. The authors notice that the optimization of the deconvolution layer decreases, the computational cost keeping accuracy stable. The achieved results were up to 90% accuracy and 60 FPS in a Jetson AGX board. Similar to LPNet in [24], the authors used an attention mechanism and ghost module that can be found in known models like MobileNet and ShuffleNet. The results show that their model achieved up to 90% with only 1.7 million parameters. Continuing with the attention mechanism approaches in [25], the authors used HRNet [15] as a backbone and a transformer module to apply feature encoding before a regression head. With this addition, the authors achieved a reduction of 86% in network parameters but without any reference around UAVs and edge device deployment. On a more general level, 2D HPE approaches that have been proposed under the perspective of efficiency and performance exploit attention mechanisms that raise the feature representation richness and decrease network architecture parameters [26,27,28,29,30,31,32]. Unfortunately, all these papers do not include UAVs application environment, which might change the performance metrics and their main scope of interest is only around lightweight solutions, which fulfill the efficiency aspect and are measured only in desktop computer machines. From the above-presented papers, only 1 is before 2019, while the rest of them are after, which is justified in Fig. 4.

2.5 3D HPE literature analysis

In the 3D approach for HPE, an interesting proposition was made in [33]. The authors proposed a lightweight model with MobileNet backbone in order to predict 3D human poses from 2D input images. The results show that they can achieve 37 FPS from a mobile device with more than 80% AP. Unfortunately, there was no application under UAV conditions where more complex challenges occur. Another similar approach is [34], where authors used YoloV2 for human detection and then with heatmap-based regression and 3D generation module the 3D human pose and bounding box are formed. Unfortunately, this work also has only been tested in an indoor environment with an animated simulator. In continuing [35] proposed a method with 2D body joint detection from multiple views with edge sensors. Exploiting a 3D human body model, bone distances, and multi-view triangulation, the 3D human pose is reconstructed. To achieve lightweight model architecture in 2D human body joints, they exploited a MobileNetV3 backbone and a direct regression head. In addition, they quantize the model to infer an 8-bit integer input image in a TPU edge device. Unfortunately, the test environment was indoor conditions with static light and close human target distance from sensors. In [36], the authors used multiple cameras and a transformer model which estimates human body joints with the heatmap-based method. In order to add 3D information to the transformer model, they included the camera parameters in the joint position approximation step after the 2D heatmap generation. In the end, the produced heatmaps are feature encoded by the transformer, and finally regression head predicted the 3D human pose. The proposed method achieved 32 ms inference time with only 5 million network parameters and 25 mm error compared to ground truth joints. Same to the previous works, the test environment was made in indoor conditions with no edge device usage. A revolutionary method was proposed in [37], where the authors exploited three known 3D models as backbones (PointNet, DGCNN, and Point Transformer) and fuse them with two linear layers in order to estimate 3D human poses. Their input data were 2D and 3D, achieving 26 ms inference time at Jetson Xavier. The drawback of this approach is that the test environment was indoors, which plays a vital role because; in high light variations, the point cloud data might be very sparse. To summarize, 3D pose estimation approaches are at a very new level because require stable light conditions and a comfortable distance from the target. In addition, the domain of deployment in edge devices remains largely unexplored with limited existing research. The scarcity of 3D data further compounds the challenges in evaluating 3D HPE methods. Notably, this paper focuses on addressing the critical aspects of performance and efficiency specifically under UAVs and edge devices, making it a challenging endeavor for 3D HPE approaches. To provide a comprehensive benchmark analysis, this study is primarily constrained to utilizing 2D data due to the complexities involved in dealing with 3D data.

3 Benchmark settings

In this section, the datasets, metrics, and algorithms of the proposed benchmark are presented. The process flow of the conducted benchmark was dataset initialization and load, model selection and load, model inference iteratively for each image, and metrics calculation for performance and efficiency. In Fig. 5, the benchmark phases are visually presented.

Fig. 5
figure 5

Benchmark process flow

The benchmark software code was implemented with Python programming language exploiting a variety of frameworks, wherein Table 1 is presented.

Table 1 Benchmark frameworks and methods
Table 2 Benchmark datasets

These frameworks do not affect the comparison of the benchmarked models since all models are fully compatible. Software code profilers are executed in parallel threads in order to avoid influence on the efficiency measurements, and they are responsible to measure hardware resource consumption of software calls. PyTorch [38] profiler is a built-in method that traces only deep learning API calls, while CProfiler calculates any software method call only in the CPU unit.

3.1 Benchmark datasets

The selected datasets for the proposed benchmark are MS-COCO [8], Occluded Human OCHuman dataset [9], and UAV-Human dataset [10]. MS-COCO [8] is a very popular benchmark dataset for a variety of CV applications like object detection, recognition, instance segmentation, and pose estimation. It is selected for the reason that contains different multiple human scales in each image content with occlusions also. OCHuman [9] is a mainstream benchmark dataset for HPE for the reason that contains highly occluded persons in larger scales than MS-COCO [8] and with fewer instances in each image (usually two). UAV-Human dataset [10] is the dataset that fulfills the proposed paper aspects from the perspective that the images are acquired by a UAV at larger distances than the other two datasets, with only one person target despite the rest crowd that appears in some cases. In addition, the input data are from videos and the person’s target poses are changing sequentially while moving, unlike the other datasets. It is proposed for single human tracking and the poses are used for action recognition. In Table 2, information about the benchmark datasets is presented.

3.2 Benchmark metrics

In this section, the benchmark metrics are presented. For analyzing the performance, 6 metrics are used, including 3 for pose estimation accuracy and 3 for object detection through pose estimation. For efficiency measurements, GPU-CPU usage and memory consumption along with model multiply accumulate operator (MACs) and frames per second (FPS) are included. In Table 3, the performance metrics are presented.

The object detection metrics are calculated by the generated bounding boxes from ground truth and predicted human poses. More specifically, the maximum and minimum human pose joints are selected to form the corners of the bounding box that surround the processed person. Next, the IoU (intersection over union) index is calculated between the ground truth pose bounding box and the estimated. If the IoU is above 90%, then the estimation is considered positive. Starting with the first metric, FPR (1) stands for false-positive rate. It is a statistical metric that indicates the false estimations of a model when ground truth data are true and the total number of false data:

$$\begin{aligned} \textrm{FPR} = \frac{\mathrm{False\, Positive}}{\mathrm{False \,Positive} + \mathrm{True\, Negative}}. \end{aligned}$$
(1)

This metric (1) is commonly used as an accuracy metric for various classification applications. Next, sensitivity (2) or true positive rate is a metric that calculates the correct estimations when the ground truth data are true:

$$\begin{aligned} \textrm{Sensitivity} = \frac{\mathrm{True\, Positive}}{\mathrm{True\, Positive} + \mathrm{True\, Negative}}. \end{aligned}$$
(2)

High sensitivity (2) means that the model correctly identifies the true labeled data. Next, precision (3) is a metric that calculates the actual true predictions among the whole data:

$$\begin{aligned} \textrm{Precision} = \frac{\mathrm{True \,Positive}}{\mathrm{True\, Positive} + \mathrm{False\, Positive}}. \end{aligned}$$
(3)
Table 3 Performance metrics

High precision (3) means that the model correctly estimates the true labeled data based on the true ground truth labels in the whole dataset. With the above metrics, the performance of HPE models in detection is measured. Next, pose estimation metrics are presented starting from OKS (4) which stands for Object Keypoint Similarity:

$$\begin{aligned} \textrm{OKS} = \frac{\sum _{i}\exp {\frac{-d_{i}^2}{2s^2k_{i}^2}}\delta (u_{i}>0)}{\sum _{i}\delta (u_{i}>0)}, \end{aligned}$$
(4)

where in (4) \(d_{i}\) is the predicted joint Euclidean distance with the ground truth joint. The term \(s\) is an image scale that is calculated from the image bounding box and \(k_{i}\) is per joint constant that attempts to homogenize the standard deviation between each joint part. OKS metric (4) takes values from zero to one and indicates how close the ground truth keypoints are with the predicted. The higher the value, the more similar the predicted and ground truth joints are. Next, PDJ (5) metric for the Percentage of Detected Joints proposed in [3] is presented:

$$\begin{aligned} d_{j}^n = \frac{\left\| Y\textrm{pred}_{j}^n - Y\textrm{true}_{j}^n\right\| _{2}}{t^n} \end{aligned}$$
(5)

where in (5) the distance between the predicted \(Y\textrm{pred}_{j}^n\) and ground truth joint \(Y\textrm{true}_{j}^n\) is calculated, while n corresponds to the person number, j is the body joint, and \(t^n\) is the torso diameter. Below the conditions based on predicted and ground truth joint distance are presented:

$$\begin{aligned}{} & {} \delta {(j, n)} \left\{ \begin{array}{@{}ll@{}} 1, &{} \ d_{j}^n <= 1 \\ 0, &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(6)
$$\begin{aligned}{} & {} \textrm{PDJ}(j) = \sum \limits _{i=1}^{N}\frac{\delta {(j, n)}}{N} \end{aligned}$$
(7)

In (7), the mathematical expression of PDJ is presented where N is the number of body joints. A disadvantage of this metric is affected by the torso diameter which might not be robust. Finally, PCK (9) which stands for Percentage of Correct Keypoints is used. It is a metric that has been proposed in [40] to overcome the above disadvantage of PDJ with the replacement of the torso diameter with the head segment length:

$$\begin{aligned}{} & {} d_{j}^n = \frac{\left\| Y\textrm{pred}_{j}^n - Y\textrm{true}_{j}^n\right\| _{2}}{h^n} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \textrm{PCK}(j) = \sum \limits _{i=1}^{N}\frac{\delta {(d_{j}^n < 0.5)}}{N} \end{aligned}$$
(9)

In (8), N is the total number of body joints, \(d_{j}^n\) is the distance with the head joint (9), n is the person number and j is the body joint. The 0.5 is a threshold that indicates the distance of the joints with the head bone link. The above pose estimation metrics can be used in both 2D and 3D methods with the difference that in 3D the threshold is replaced with true distance value.

Table 4 Efficiency metrics

In Table 4, frames per second (FPS) measures the number of images that are processed or displayed each second and indicates the processing speed. The GPU-CPU usage and memory are measured by the PyTorch [38] profilers specifically for deep learning operations. The GPU and CPU usage is calculated by the number of requests from the operating system to each unit plus the request time execution divided by the total process time. GPU Watt consumption is measured by each model inference for each image, and finally, the mean value is produced. Despite device resource consumption, there are some model structure measurements that are taken into consideration like MACs which indicate a theoretical computational cost of each model. Next, the selected benchmark models are presented.

3.3 Benchmark HPE models

For benchmark purposes, 36 different HPE models are adopted and their performance is analyzed to find the most suitable for UAV deployment. Acceleration techniques are very important in neural network inference, especially under UAV conditions. They include numerous methods [41] like quantization, pruning, or computation graph optimizers like ONNX or Tensor-RT which provide a tremendous boost in model inference speed. Unfortunately, most multi-person and some single-person model architectures do not fully support these techniques because customized, attention mechanisms, neural network layer operations, or sub-networks are used, which makes complex the usage of acceleration techniques. Thus most of them are not implemented with this characteristic in this study. Single-person models usually have architectures of single-structure neural networks which support better acceleration techniques. The benchmark models are used as they are published and acceleration techniques are not taken into consideration by the proposed benchmark. All of the selected models use a variety of CNN backbones for the feature extraction process, where some of them are very large and some of them lightweight. This discrimination is made in order to highlight the importance of algorithm efficiency and the trade-off with solution performance. In Tables 5 and 6, the selected models are presented along with the number of convolution, pooling, and batch normalization layers number. First, it is clear that pooling layers are not preferred for the HPE problem, which justifies that the downscale of the feature map resolution leads to false localization of human joints. A second notice is that almost all models use many batch normalization layers, which are included in residual blocks along with ReLU between convolution layers. With this technique, the models that solve the HPE problem have the advantage to extract high-level features and preserve important information, which could be characterized as a replacement for the pooling operation.

Table 5 Benchmark lightweight HPE models
Table 6 Benchmark ResNet-based HPE models

As presented in Tables 5 and 6, 22 models are using ResNet as a backbone for feature extraction [3, 47, 49,50,51,52,53,54,55,56]. ResNet architecture has the characteristic of feature map scaling from high to low. In the HPE problem, this characteristic gives the capability to extract global and local features for human joints which makes the model robust for scale and occlusion variations. From the 36 models, 11 [16, 42,43,44,45,46,47] are exploiting lightweight backbones like MobileNet, AlexNet, and ShuffleNet, which have the characteristic of small and simple CNN structure, making them suitable balancing between performance and efficiency. Finally, the YoloV5 model [48] is a direct regression approach that differs from the rest, which uses sub-networks, attention mechanisms, and other approaches to estimate human body joints. Much more weight has been given to top-down approaches which consist of 28 [3, 16, 16, 42,43,44,45,46,47, 49,50,51,52, 54,55,56] out of 36 models because their performance and efficiency are not constrained by the number of humans in the input image. Based on the application, nature top-down methods can be used for single- and multiple-person pose estimation.

4 Benchmark results and discussion

In this section, the benchmark results will be presented, starting with the performance metrics in problem-solving and concluding with the efficiency measurements. The top three models with the highest scores are marked in performance results.

Table 7 MS-COCO dataset performance benchmark results

In a general view, all models achieve nearly 80% accuracy in all three pose estimation metrics. Considering the object detection metrics, almost all estimated poses are not far in pixel distance from the ground truth joints. The three most performed models for MS-COCO [8] are HigherNet-50 [53] in the first place with 83%, 88%, and 90% in all three pose estimation metrics. In the second place with a 5–10% difference comes MobileNetV1-AE [45] with 77%, 82%, and 85% for the same metrics. In the third place is ResNet-101-AE [54] having a difference of 1–4% with 76%, 79%, and 81%. Another notice is that lightweight models have similar performance to ResNet-based models. In Table 8, the performance metrics for the OCHuman dataset are presented.

Table 8 OCHuman dataset performance benchmark results

In general, all models have poor performance in OCHuman [9] dataset, and the cause of this is the high level of occlusions between humans with their clothing variations on top of that. In addition, all the above models are trained on MS-COCO [8] which does not share the same level of difficulty. Most of the ResNet-based models perform better than lightweight models having a 1–5% difference in all three pose estimation metrics. The top three models in this dataset are DiteHrnet-18 [51] in the first place with 64%, 69%, and 71% in pose estimation metrics. In the second place is VGG19-PAF [16] with 64%, 68%, and 69% in pose estimation metrics, where ShuffleNetV1-PAF [16] is close. In the third place is 2xMSPN-50 [49] with 64%, 67%, and 69% accuracy in pose estimation. In Table 9, the performance metrics for the UAV-Human dataset are presented.

Table 9 UAV-Human dataset performance benchmark results

In Table 9, all models have higher performance scores in the UAV-Human dataset [10] compared to previous datasets for the reason that each image has only a single target for pose estimation, while the other two have multiples. Again, ResNet-based models have the highest performance, indicating that they can extract more robust features even in small image scale conditions. The first place comes with ResNet-50-DCN [55], which achieved 94%, 95%, and 99% in all three pose estimation metrics. In second place is ResNet-50-AE [54] with 89%, 91%, and 95% scores. In third place is ResNet-101-AE [54] with 91%, 92%, and 96% scores. Despite ResNet-based, lightweight models are very close in performance scores even if they have smaller structures with MobileNetV2-AE [45] and HRNet-L [43] performing 1–2% difference with the top three models in PCK metric and 6–10% in the other two pose estimation metrics.

In all the above tables, ResNet-based models seem to hold the first places because they managed to handle the dataset’s difficulties better, based on the performance results. However, it is worth to be mentioned that lightweight models are very close. The chosen datasets encompass a comprehensive range of challenging conditions that pose difficulties in detection, including scale variations and occlusions when viewed from an aerial camera perspective. In particular, MS-COCO and OCHuman datasets comprise images depicting diverse human scales alongside significant occlusions. Moreover, the UAV-Human dataset incorporates variations in scale, specifically featuring images with small-scale human subjects in their content. Consequently, evaluating the models’ performance on each dataset can effectively showcase their robustness under specific conditions. It is worth noting that conventional neural network operations do not inherently offer scale, rotation, or occlusion invariances to the models. Rather, these characteristics are acquired through training data, thereby enabling the models to exhibit robustness against such degradation factors. Considering the performance metrics, it becomes evident that occlusions pose a more significant challenge compared to scale variations or the specific perspective of UAV camera imagery. CNNs enhanced with mechanisms such as AE or PRM have emerged as the top-performing models across all three datasets, clearly demonstrating their resilience against the degradation factors present in these datasets. By leveraging AE, CNNs exhibit improved grouping of individual person joints, leading to enhanced matching of identical joints and better differentiation among multiple-person joints. As a result, this mechanism effectively ensures occlusion invariance within the models, addressing the challenges associated with occlusions in the context of human pose estimation (HPE) problem. By employing PRM, a fusion of local and global features is achieved through a reweighting process applied to the generated feature map. This results in the creation of a more balanced feature map, which significantly improves the accuracy of pose estimation. Notably, this attention mechanism plays a crucial role in enhancing the model’s robustness against both scale variations and occlusions. The incorporation of custom neural network layers and attention mechanisms, as evident from the performance results, effectively supports the model’s resilience in overcoming the aforementioned challenges. Concluding the performance benchmark results presentation, it remains unclear which model is best suited for deployment on edge devices. To further explore this aspect, measurements of model efficiency are provided and analyzed. Within each dataset, the top three models are identified based on the highest FPS, and the lowest efficiency measurements. For the sake of the experiments, an Nvidia RTX 2060 with 8 GB VRAM GPU and an AMD Threadripper 2920X 12-core CPU hardware was utilized.

Table 10 MS-COCO dataset efficiency benchmark results of lightweight models
Table 11 MS-COCO dataset efficiency benchmark results of ResNet-based models

Considering both Tables 10 and 11, it is clear that lightweight models have the first place because they maintain low GPU usage and memory and achieve high speed based on FPS. A balanced model between ResNet-based and lightweight is VGG16-PAF, which achieves high FPS with a more friendly computational cost. Based on both tables, the top three models are in Table 10, which have the highest speed and the lowest hardware consumption in comparison with the rest of the models. It is clear that ResNet-based models are computationally heavy, leaving no space for more models to be deployed in order to extract higher information about the human targets in image content. Including performance metrics verified that it is very challenging for a model to optimally balance performance and efficiency. Compared with performance metrics, efficiency results show a higher difference between lightweight and the rest of the models. This notice is justified in FPS and MACs where most of the lightweight models have more than 9 FPS and 1.0 GMAC differences from ResNet-based models. Besides general notices, the most efficient model for this dataset is AlexNet-CPM [42] with very low GPU usage and memory and with high FPS. The same notice implies for ShuffleNetV1-PAF [46] and VIPNAS with a MobileNetV3 backbone [47]. Very close to these models is MobileNetV1-AE [45].

Table 12 OCHuman dataset efficiency benchmark results

Analyzing Table 12 results, the FPS dropped significantly in comparison with the previous dataset. This is caused for the reason that the OCHuman dataset [9] has higher image resolution (1280x720) compared to MS-COCO [8] which increases GPU and CPU resource consumption. From the model’s perspective, once again ResNet-based requires a huge amount of resources in comparison with lightweight models, which require less than 100 MB of GPU memory. Lightweight models hold the first place with AlexNet-CPM [42] achieving almost 20 FPS and very low GPU usage and memory. In second place is MobileNetV3-PAF [44] which has a higher FPS in comparison with the previous model but more computational cost. Finally, VIPNAS [47] with MobileNetV3 as a backbone is in third place.

Table 13 UAV dataset efficiency benchmark results

In Table 13, efficiency measurements for UAV dataset [10] are common to OCHuman [9] for the reason that share the same image resolution. Like before, lightweight models are the most friendly for the device’s computational resources with the highest speed. Once again the first place has AlexNet-CPM [42] with high speed and low device resources consumption, while the second and third place are MobileNetV3-PAF [44] and VIPNAS [47] with MobileNetV3 as a backbone. To close the efficiency analysis in Fig. 9, the HPE model GPU watt consumption is presented. These watt measurements are taken by the difference in GPU watt consumption before and after model inference for each image. Finally, the mean is calculated in all measurements.

Fig. 6
figure 6

Benchmark models watt consumption

From Fig. 6, it is clear that the most efficient models have the lowest watt consumption which characterizes them as computationally friendly and lightweight. In continuing visual results of most performed models are presented below, with an image sample for each dataset.

In Fig. 7, the ground truth for the COCO dataset is presented, where the human pose has a very small scale in a complex environment considering the rest image areas. In addition, some human body parts are not visible like the left leg. From the results, it is clear that scale affects the performance of the models since all predictions are not as tidy as the ground truth. In general, the results are qualitative in both lightweight and ResNet-based models. This similarity in the performance between the models may arise from the fact that almost all 12 models share common modifications like AE or DCN, which are attention mechanisms that have boosted the HPE estimations as mentioned and analyzed in the previous section. In Fig. 8, the ground truth and 12 most performed models’ estimations for the OCHuman dataset are presented. In this test image, the occlusions are very high between humans and the body parts have very complex positions and orientations. Based on the models’ estimations, the ResNet-based models are performing better since they have a larger feature extractor than lightweight models. All ResNet-based models predict both football players based on the ground truth and lightweight models predicted only the front one. In a more general view, all models show unsatisfied results in the localization of the human body points in comparison with the ground truth.

Same as before in Fig. 9, the UAV-Human dataset results are presented. In this dataset, the target is to identify the action of the human through the human pose in a video sequence. In the specific sample, the human pose points are very close to each other and at a far distance. Based on the model’s results, it is clear that all of the models have similar performance despite their difference in model architecture. Considering the rest of the visual results, it seems that even with a small model architecture robust results can be achieved in normal cases of scale and occlusions with the appropriate modifications and additions (attention mechanisms, custom layers). Including the performance metrics results presented previously, MobileNetV3-PAF [44] is the most balanced between performance and efficiency, for the reason that brings performance metrics very close to the top ones. Excluding MS-COCO [8] and OCHuman [9] datasets from the scope of analysis and focusing on the UAV-Human dataset [10] more models could be characterized as candidates for an equal trade-off between performance and efficiency, like MobileNetV2-AE [45], HRNet-L [43], and VIPNAS [47], which is also in the first places of efficiency metrics. Including both performance and efficiency results, the most suitable for a UAV deployment are MobileNetV2-AE [45], HRNet-L [43], VIPNAS [47], and MobileNetV3-PAF [44]. Closing this section, the most balanced model and the other three mentioned above are selected for the analysis that will be presented in the next section.

Fig. 7
figure 7

Visual results of the 12 highly performed models on the COCO dataset

Fig. 8
figure 8

Visual results of the 12 highly performed models on the OCHuman dataset

Fig. 9
figure 9

Visual results of the 12 highly performed models on the UAV-Human dataset

5 Benchmark results projection in edge device specifications

In this section, the efficiency measurements will be projected in edge devices hardware specifications for the reason that the presented benchmark is conducted on a desktop machine with specifications quite far from the edge device. The selected edge devices are Jetson TX2 NX and Xavier NX because are the only arm-based devices that have a strong GPU with low energy consumption. Between these two edge devices, Xavier is selected for the reason that has hardware specifications near a common computer, and TX2 has more limited computational resources. In Table 14, the benchmark conducted machine and the above-mentioned board’s hardware specifications are presented.

Table 14 Efficiency metrics

Comparing the above devices the desktop GPU can achieve 2 times the TFLOPs of Jetson Xavier and 30 times of Jetson TX2. Between the Jetson devices, Xavier can achieve 15 times the TFLOPs of Jetson TX2. From the perspective of GPU, the desktop machine has 7 times more CUDA cores than Xavier and 8.5 times more than TX2. Between the Jetson devices, Xavier has 1.5 more CUDA cores than TX2.

In order to estimate the selected model’s efficiency, two known metrics are used code and machine balance [57] or balance analysis. In addition, the performance ratio will be presented which indicates the software’s efficiency. In balance analysis, the included metrics measure the software code performance under specific hardware computational capacities based on its resources.

$$\begin{aligned}{} & {} B_{m} = \frac{b_\textrm{max}}{P_\textrm{max}} \end{aligned}$$
(10)
$$\begin{aligned}{} & {} b_\textrm{max} = \frac{\textrm{MemoryBandwidth}}{\mathrm{bytes/word}} \end{aligned}$$
(11)

In (10), the term \(b_\textrm{max}\) is calculated by the device memory bandwidth divided by the bytes/word that the device can achieve as presented in (11). The term \(P_\textrm{max}\) is the peak floating-point operation of the device. The units of measurement for \(B_{m}\) are words/flop. More specifically, \(B_{m}\) is the ratio of memory operation per CPU or GPU cycle to the number of float-point operations in the same processor. Accordingly, code balance is a ratio between the memory bandwidth of a software code divided by the flops that are produced. Below the mathematical formula of code balance is presented.

$$\begin{aligned} B_{c} = \frac{\textrm{DataTraffic}}{\textrm{codeFLOPs}} \end{aligned}$$
(12)

In (12), the term (DataTraffic) can be estimated by the sum of data loads and data stored in variables. In terms of CNNs, this summation is made between the amount of loaded data summed by the stored inference data. Finally, these results are divided by the model MACs in order to compute CNN models code balance \(B_{c}\). Between these two metrics (10) and (12) exists the relationship that if \(B_{m}\) is smaller than \(B_{c}\), then the software code is not sufficient for the candidate device. It is worth mentioning that these metrics are not systematic bias because they do not include cache memory reads or latencies from memory operations. The performance ratio is an indicator calculated by the division of memory bandwidth and code balance, where the result is compared with the peak performance of the device.

$$\begin{aligned} P = \min {\left( P_\textrm{max}, \frac{b_\textrm{max}}{B_{c}}\right) } \end{aligned}$$
(13)

From (13), the results indicate the maximum achievable performance. From the fraction of \(b_\textrm{max}\) and \(B_{c}\), the required performance in flops is estimated. If the result is the same as \(P_\textrm{max}\), then the code will cover the whole device performance which is not efficient for the reason that will require full power consumption. In addition, the fraction \(\frac{b_\textrm{max}}{B_{c}}\) will indicate the amount of required performance from the device. In Table 15, the above metrics for the presented devices and models are presented.

Table 15 \(B_{c}\) values of the selected models

Analyzing Tables 14 and 15 values, it is clear that all models can sufficiently be executed in the above edge devices since \(B_{c}\) is smaller than hardware \(b_\textrm{max}\). In addition, the model’s memory consumption can perfectly fit on both Jetson devices. The performance ratio of the model’s achievable performance is 10–20% of the peak performance for the Jetson TX2 device and accordingly 2–6% for the Jetson Xavier. More specifically, MobileNetV3-PAF [44] and HRNet-L [43] have the higher performance ratio since they have the lower \(B_{c}\) compared to the other models. From the above analysis, it is clear that at a theoretical level, all indicators show that the above 4 models are sufficient for the selected edge devices w.r.t efficiency and solution performance. Based on \(B_{c}\) and performance ratio, MobileNetV3-PAF [44] is the best candidate for both edge devices. Considering also the benchmark results, the MobileNetV3-PAF [44] model fulfills the paper aspects.

6 Conclusion

The proposed paper analyzes and presents a review of HPE methods and a benchmark with multiple aspects. The results of the review showed that 2D HPE proposed methods have gained a significant impact on the problem-solving challenges two times, once with the CNNs proposition and twice when 3D HPE methods gain interest from the CV community. Unfortunately, from the aspect of efficiency and UAVs deployment, there are a few robust proposed methods in 2D and even fewer in 3D data. Based on benchmark results and edge device projection, lightweight models with MobileNet backbone have the most optimal balance between performance and efficiency, with MobileNetV3-PAF [44] as the most suitable among them.

Despite the results, it is proven difficult to design an efficient HPE method that will perform robustly on UAVs even for CNNs. Of the 36 models, only 4 were candidates, and finally, only 1 is appropriate for achieving high performance with low efficiency without stressing the edge device based on the projection. More abstractly, there are very few models that deal with active vision HPE under edge device deployment in UAVs. This fact can be justified by the mainstream benchmark datasets that are used for development, which include non-sequential images with random conditions and close distances. Another fact that justifies the above mentioned is the usage of very large CNN backbones for feature extraction, which is computationally intense. A future direction may include the development of a lightweight method with the exploitation of human pose motion information in order to predict and detect human poses in the next frame rather than applying the same method in each frame. This might give a boost to human detection and estimation in sequential images and make them the next-generation human trackers. Another future direction of the presented study is the benchmark of 3D HPE models under the aspects of performance and efficiency in order to support the conclusions with numerical results.