Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey

In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications, such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the quality and quantity of annotated 3D hand poses. Under the status quo, acquiring such annotated 3D hand poses is challenging, e.g., due to the difficulty of 3D annotation and the presence of occlusion. To reveal this problem, we review the pros and cons of existing annotation methods classified as manual, synthetic-model-based, hand-sensor-based, and computational approaches. Additionally, we examine methods for learning 3D hand poses when annotated data are scarce, including self-supervised pretraining, semi-supervised learning, and domain adaptation. Based on the study of efficient annotation and learning, we further discuss limitations and possible future directions in this field.


Introduction
The acquisition of 3D hand pose annotations 1 has presented a significant challenge in the study of 3D hand pose estimation.This makes it difficult to construct large training datasets and develop models for various target applications, such as hand-object interaction analysis [5,25], posebased action recognition [38,79,91], augmented and virtual reality [28,46,98], and robot learning from human demonstration [14,29,51,69].In these application scenarios, we must consider methods for annotating hand data, and select positional labels, these setups require a controlled environment, which limits available scenarios.As for the second obstacle, occlusion hinders annotators from accurately localizing the positions of hand joints.As for the third obstacle, annotated data are biased to a specific condition constrained by the annotation method.For instance, annotation methods based on hand markers or multi-view setups are usually installed in laboratory settings, resulting in a bias toward a limited variety of backgrounds and interacting objects.
Given such challenges in annotation, we conduct a systematic review of the literature on 3D hand pose estimation from two distinct perspectives: efficient annotation and efficient learning (see Fig. 1).The former view highlights how existing methods assign reasonable annotations in a cost-effective way, covering a range of topics: the availability and quality of annotations and the limitations when deploying the annotation methods.The latter view focuses on how models can be developed in scenarios where annotation setups cannot be implemented or available annotations are insufficient.
In contrast to existing surveys on network architecture and modeling [10,17,44,45,48], our survey delves into another fundamental direction that arises from the annotation issues, namely, dataset construction with cost-effective annotation and model development with limited resources.In particular, our survey includes benchmarks, datasets, image capture setups, automatic annotation, learning with limited labels, and transfer learning.Finally, we discuss potential future directions of this field beyond the current state of the art.
For the study of annotation, we categorize existing methods into manual [9,58,86], synthetic-model-based [11,31,56,58,109], handmarker-based [21,88,105], and computational approaches [25,42,43,55,81,110].While manual annotation requires querying human annotators, hand markers automate the annotation process by tracking sensors attached to a hand.Synthetic methods utilize computer graphics engines to render plausible hand images with precise keypoint coordinates.Computational methods assign labels by fitting a hand template model to the observed data or using multi-view geometry.We find these annotation methods have their own constraints, such as the necessity of human effort, the Fig. 1 Our survey on 3D hand pose estimation is organized from two aspects: (i) obtaining 3D hand pose annotation and (ii) learning even with a limited amount of annotated data.These two issues will be considered in the scenarios of practical applications where we work on dataset construction and model development with limited resources.The figure is adapted from [109].
sim-to-real gap, the changes in hand appearance, and the limited portability of the camera setups.Thus, these annotation methods may not always be adopted for every application.
Due to the problems and constraints of each annotation method, we need to consider how to develop models even when we do not have enough annotations.Therefore, learning with a small amount of labels is another important topic.For learning from limited annotated data, leveraging a large pool of unlabeled hand images as well as labeled images is a primary interest, e.g., in selfsupervised pretraining, semi-supervised learning, and domain adaptation.Self-supervised pretraining encourages the hand pose estimator to learn from unlabeled hand images, so it enables building a strong feature extractor before performing supervised learning.While semi-supervised learning trains the estimator with labeled and unlabeled hand images collected from the same environment, domain adaptation further solves the so-called problem of domain gap between the two image sets, e.g., the difference between synthetic data and real data.
The rest of this survey is organized as follows.In Section 2, we introduce the formulation and modeling of 3D hand pose estimation.In Section 3, we present open challenges in the construction of hand pose datasets involving depth measurement, occlusion, and dataset bias.In Section 4, we cover existing methods of 3D hand pose annotation, namely manual, Fig. 2 Formulation and modeling of single-view 3D hand pose estimation.For input, we use either RGB or depth images cropped to the hand region.The model learns to produce a 3D hand pose defined by 3D coordinates.Some works additionally estimate hand shape using a 3D hand template model.For modeling, there are three major designs; (A) 2D heatmap regression and depth regression, (B) extended three-dimensional heatmap regression called 2.5D heatmaps, and (C) direct regression of 3D coordinates.synthetic-model-based, hand-marker-based, and computational approaches.In Section 5, we provide learning methods from a limited amount of annotated data, namely self-supervised pretraining, semi-supervised learning, and domain adaptation.In Section 6, we finally show promising future directions of 3D hand pose estimation.

Overview of 3D Hand Pose Estimation
Task setting.As shown in Fig. 2, 3D hand pose estimation is typically formulated as the estimation from a monocular RGB/depth image [18,37,103].The output is parameterized by the hand joint positions with 14, 16, or 21 keypoints, which are introduced in [92], [89], and [68], respectively.The dense representation of 21 hand joints 2 has been popularly used as it contains more precise information about hand structure.For a single RGB image in which depth and scale are ambiguous, the 3D coordinates of the hand joint relative to the hand root are estimated from a scalenormalized hand image [6,22,109].Recent works additionally estimate hand shape by regressing 3D hand pose and shape parameters together [5,22,57,107].In evaluation, produced prediction is compared with ground truth, e.g., in the space of 2 Five end keypoints are fingertips, not strictly called joints.
world or image coordinates.These two metrics are often used: mean per joint position error (MPJPE) in millimeters, and area under curve of percentage of correct keypoints (PCK-AUC).
Modeling.Classic methods estimate a hand pose by finding the closest sample from a large set of hand poses, e.g., synthetic hand pose sets.Some works formulate the task as nearest neighbor search [73,75] while others solve pose classification given predefined hand pose classes and a SVM classifier [72,74,87].
Recent studies have adopted an end-to-end training manner where models learn the correspondence between the input image and its label of the 3D hand pose.Standard single-view methods from an RGB image [6,22,109] consist of (A) the estimation of 2D hand poses by heatmap regression and depth regression for each 2D keypoint (see Fig. 2).The 2D keypoints are learned by optimizing heatmaps centered on each 2D hand joint position.An additional regression network predicts the depth distance of detected 2D hand keypoints.Other works use (B) extended 2.5D heatmap regression with a depth-wise heatmap in addition to the 2D heatmaps [39,55], so it does not require a depth regression branch.Depthbased hand pose estimation also utilizes such heatmap regression [36,71,100].Instead of the heatmap training, other methods learn to (C) directly regress keypoint coordinates [77,85].For the architecture of the backbone network, CNNs (e.g., ResNet [33]) are a basic choice while recent Transformer-based methods have been proposed [26,35].To generate feasible hand poses, regularization is a key trick in correcting predicted 3D hand poses.Based on the anatomical study of hands, bio-mechanical constraints are imposed to limit predicted bone lengths and joint angles [13,47,83].

Challenges in Dataset Construction
Task formulation and algorithms for estimating 3D hand poses are outlined in Section 2. During training, it is necessary to build a large amount of training data with diverse hand poses, viewpoints, and backgrounds.However, obtaining such massive hand data with accurate annotations has been challenging for the following reasons.
Difficulty of 3D annotation.Annotating the 3D position of hand joints from a single RGB image is inherently impossible without any prior information or additional sensors due to an illposed condition.To assign accurate hand pose labels, hand-marker-based annotation using magnetic sensors [21,97,105], motion capture systems [54,78,88], or hand gloves [4,23,96] has been studied.These sensors can provide 6-DoF information (i.e., location and orientation) of attached markers and enable us to calculate the coordinates of full hand joints from the tracked markers.However, their setups are expensive and need good calibration, which constrains available scenarios.
On the contrary, depth sensors (e.g., RealSense) or multi-view camera studios [9,25,55,81,110] make it possible to obtain depth information near hand regions.Given 2D Fig. 4 Example of major data collection setups.The synthetic image on the left (ObMan [31]) can be generated inexpensively, but they exhibit unrealistic hand texture.The hand markers on the middle (FPHA [21]) enable automatic tracking of hand joints, although the markers distort the appearance of hands.The in-lab setup on the right (DexYCB [9]) uses a black background to make it easier to recognize hands and objects, but it limits data variation in environments.
keypoints for an image, these setups enable annotation of 3D hand poses by measuring the depth distance at each 2D keypoint.However, these annotation methods do not always produce satisfactory 3D annotations, e.g., due to an occlusion problem (detailed in the next section).In addition, depth images are significantly affected by the sensor noise, such as unknown depth values in some regions and ghost shadows around object boundaries [101].Due to the limited depth distance that depth cameras can capture, the depth measurement becomes inaccurate when the hands are far from the sensor.
Occlusion.Hand images often contain complex occlusions that distract human annotators from localizing hand keypoints.Examples of possible occlusions are shown in Fig. 3.In figure (a), articulation causes a self-occlusion that makes some hand joints (e.g., fingertips) invisible due to the overlap with the other parts of the hand.In figure (b), such self-occlusion depends on a specific camera viewpoint.In figure (c), hand-held objects induce occlusion that hides the hand joints by the object during the interaction.
To address this issue, hand-marker-based tracking [21,88,97,105] and multi-view camera studios [9,25,55,81,110] have been studied.The hand markers offer 6-DoF information during these occlusions, so the hand-maker-based annotation is robust to the occlusion.For multi-camera settings, the effect of occlusion can be reduced when many cameras are densely arranged.In order to improve the generalization ability of hand pose estimators, hand images must be annotated under various imaging conditions (e.g., lighting, viewpoints, hand poses, and backgrounds).However, it is challenging to create such large and diverse datasets nowadays due to the aforementioned problems.Rather, existing hand pose datasets exhibit a bias to a particular imaging condition constrained by the annotation method.

Dataset bias. While hands are a common entity in various image capture settings, the category
As shown in Fig. 4, generating data using synthetic models [11,31,56,58,109] is costeffective, but it creates unrealistic hand texture [63].Although the hand-marker-based annotation [21,88,97,105] can automatically track the hand joints from the information of hand sensors, the sensors distort the hand appearance and hinder the natural hand movement.In-lab data acquired by multi-camera setups [9,25,55,81,110] make the annotation easier because they can reduce the occlusion effect.However, the variations in environments (e.g., backgrounds and interacting objects) are limited because the setups are not easily portable.

Annotation Methods
Given the above challenges concerning the construction of hand pose datasets, we review existing 3D hand pose datasets in terms of annotation design.As shown in Table 1, we categorize the annotation methods as manual, syntheticmodel-based, hand-marker-based, and computational approaches.We then study the pros and cons of each annotation method in Table 2.

Manual annotation
MSRA [68], Dexter+Object [86], and EgoDexter [58] manually annotate 2D hand keypoints on the depth images and determine the depth distance from the depth value of the images on the 2D point.This method enables assigning reasonable annotations of 3D coordinates (i.e., 2D position and depth) when hand joints are fully visible.
However, it is not extensively available according to the number of frames due to the high annotation cost.In addition, since it is not robust for occluded keypoints, this approach only allows fingertip annotation, instead of full hand joints.For these limitations, these datasets provide a small amount of data (≈ 3K images) used for evaluation only.Additionally, these single-view datasets can produce view-dependent annotation errors because a single-depth camera captures the distance to the hand skin surface, not the true

Synthetic-model-based annotation
To acquire large-scale hand images and labels, synthetic methods based on synthetic hand and full-body models [49,74,76,94] have been proposed.SynthHands [58] and RHD [109] render synthetic hand images with randomized real backgrounds from either a first-or third-person view.MVHM [11] generates multi-view synthetic hand data rendered from eight viewpoints.These datasets have succeeded in providing accurate hand keypoint labels on a large scale.Although they can generate various background patterns inexpensively, the lighting and texture of hands are not well simulated, and the simulation of handobject interaction is not considered in the data generation process.
To handle these issues, GANerated [56] utilizes GAN-based image translation to stylize synthetic hands more realistically.Furthermore, ObMan [31] simulates the hand-object interaction in data generation using a hand grasp simulator (Graspit [53]) with known 3D object models (ShapeNet [8]).Ohkawa et al .proposed foreground-aware image stylization to convert the simulation texture in the ObMan data to a more realistic one while separating the hand regions and backgrounds [63].Corona et al .attempted to synthesize more natural hand grasps with affordance classification and the refinement of fingertip locations [15].However, the ObMan data only provide static hand images with hand-held objects, not hand motion.

Hand-marker-based annotation
As shown in Fig. 5, hand-marker-based annotation automatically tracks attached hand markers and then calculates the coordinates of hand joints.Initially, Wetzler et al .attached magnetic sensors to fingertips that provide 6-DoF information of the markers [97].While this scheme can annotate fingertips only, recent datasets, BigHand2.2M[105] and FPHA [21], use these sensors to offer the annotation of the full 21 hand joints.Fig. 6 shows how to compute the joint positions given six magnetic sensors.It uses inverse kinematics to infer all 21 hand joints, which fits a hand skeleton with the constraints of the maker positions and user-specific bone length manually measured beforehand.
However, these sensors obstruct natural hand movement and distort the appearance of the hand.Due to the changes in hand appearance, these Fig. 7 Illustration of a multi-camera setup [110].
datasets have been proposed for the benchmark of depth-based estimation, not the RGB-based task.On the contrary, GRAB [88] is built with a motion capture system for human hands and body, but it does not possess visual modality, e.g., RGB images.

Computational annotation
Computational annotation is categorized into two major approaches: hand model fitting and triangulation.Unlike hand-marker-based annotation, these methods can capture natural hand motion without attaching hand markers.

Model fitting (depth).
Early works of computational annotation utilize model fitting on depth images [37,103].Since a depth image provides 3D structural information, their works fit a 3D hand model, from which joint positions can be obtained, to the depth image.ICVL [89] fits a convex rigid body model by solving a linear complementary problem with physical constraints [52].NYU [92] uses a hand model defined by spheres and cylinders and formulates the model fitting as a kind of particle swarm optimization [64,65].The use of other cues for the model fitting is also studied [2,50], such as edges, optical flow, shading, and collisions.Sharp et al .paint hands to obtain hand part labels by color segmentation on RGB images and the proxy cue of hand parts further helps the depth-based model fitting [80].
Using these depth datasets, several more accurate labeling methods have been proposed.Rogez et al .gave manual annotation to a few joints and searched the closest 3D pose from a pool of synthetic hand pose data [74].Oberweger et Fig. 8 Illustration of a many-camera setup [99].This setup has about 100 synchronized cameras and is used to create the InterHands2.6Mdataset [55].
al .considered model fitting with temporal coherence [59].This method selects reference frames from a depth video and asks annotators for manual labeling.Model fitting is done separately for annotated reference frames and unlabeled nonreference frames.Finally, all sequential poses are optimized to satisfy temporal smoothness.

Triangulation (RGB).
For the annotation of RGB images, a multi-camera studio is often used to compute 3D points by multi-view geometry, i.e., triangulation (see Fig. 7).Panoptic Studio [81] and InterHand2.6M[55] triangulate a 3D hand pose from multiple 2D hand keypoints provided by an open source library, OpenPose [34], or human annotators.The generated 3D hand pose is reprojected onto the image planes of other cameras to annotate hand images with novel viewpoints.This multi-view annotation scheme is beneficial when many cameras are installed (see Fig. 8).For instance, the InterHand2.6Mmanually annotates keypoints from 6 views and reprojects the triangulated points to the other many views (100+).This setup can produce over 100 training images for every single annotation.The InterHand2.6M has million-scale training data.
This point-level triangulation method works quite well when many cameras (30+) are arranged [55,81].However, the AssemblyHands setup [61] has only eight static cameras, and then the predicted 2D keypoints to be triangulated tend to be suboptimal due to hand-object occlusion during the assembly task.To improve the accuracy of triangulation in such sparse camera settings, Ohkawa et al .adopt multi-view aggregation of encoded features by the 2D keypoint detector and compute 3D coordinates from constructed 3D volumetric features [3,40,61,110].This featurelevel triangulation provides better accuracy than the point-level method, achieving an average keypoint error of 4.20 mm, which is 85% lower than the error of the original annotations in Assem-bly101 [79].

Model fitting (RGB).
Model fitting is also used in RGB-based pose annotation.FreiHAND [108,110] utilizes a 3D hand template (MANO [76]) fitting to multi-view hand images with sparse 2D keypoint annotation.The dataset increases the variation of training images by randomly synthesizing the background and using captured real hands as the foreground.YouTube3DHands [42] uses the MANO model fitting to estimated 2D hand poses in YouTube videos.HO-3D [25], DexYCB [9], and H2O [43] jointly annotate 3D hand and object poses to facilitate a better understanding of hand-object interaction.Using estimated or manually annotated 2D keypoints, their datasets fit the MANO model and 3D object models to the hand images with objects.
While most methods capture hands from static third-person cameras, H2O and AssemblyHands install first-person cameras that are synchronized with static third-person cameras (see Fig. 9).With camera calibration and head-mounted camera tracking, such camera systems can offer 3D hand pose annotations for first-person images by projecting annotated keypoints from thirdperson cameras onto first-person image planes.This reduces the cost of annotating first-person images, which is considered expensive because the Fig. 10 Self-supervised pretraining of 3D hand pose estimation [108].The pretraining phase (step 1) aims to construct an improved encoder network by using many unlabeled data before supervised learning (step 2).The work uses MoCo [32] as a method of self-supervised learning.
image distribution changes drastically over time and the hands are sometimes out of view.
These computational methods can generate labels with little human effort, although the camera system itself is costly.However, assessing the quality of the labels is still difficult.In fact, the annotation quality depends on the number of cameras and their arrangement, the accuracy of hand detection and the estimation of 2D hand poses, and the performance of triangulation and fitting algorithms.

Learning with Limited Labels
As explained in Section 4, existing annotation methods have certain pros and cons.Since perfect annotation in terms of amount and quality cannot be assumed, training 3D hand pose estimators with limited annotated data is another important study.Accordingly, we introduce learning methods using unlabeled data in this section, namely selfsupervised pretraining, semi-supervised learning, and domain adaptation.

Self-supervised pretraining and learning
Self-supervised pretraining aims to utilize massive unlabeled hand images and build an improved encoder network before supervised learning with labeled images.As shown in Fig. 10, recent works [82,108] first pretrain an encoder network that extracts image features by using contrastive learning (e.g., MoCo [32] and SimCLR [12]) and Fig. 11 Semi-supervised learning of 3D hand pose estimation [47].The model is trained jointly on annotated data and unlabeled data with pseudo-labels.
then fine-tune the whole network in a supervised manner.The core idea of contrastive learning is to push a pair of similar instances closer together in an embedding space while unrelated instances are pushed apart.This approach focuses on how to define the similarity of hand images and how to design embedding techniques.Spurr et al .
proposed to geometrically align two features generated from differently augmented instances [82].Zimmermann et al .found that multi-view images representing the same hand pose can be effective pair supervision [108].Other works utilize the scheme of selfsupervised learning that solves an auxiliary task, instead of the target task of hand pose estimation.Given the prediction on an unlabeled depth image, [60,95] render a synthetic depth image and penalize the matching between the input image and the one generated from the prediction.This auxiliary loss by image synthesis is informative even when annotations are scarce.

Semi-supervised learning
As shown in Fig. 11, semi-supervised learning is used to learn from small labeled data and large unlabeled data simultaneously.Liu et al .proposed a pseudo-labeling method that learns unlabeled instances with pseudo-ground-truth given from the model's prediction [47].This pseudo-label training is applied only when its prediction satisfies spatial and temporal constraints.The spatial constraints check the correspondence of a 2D hand pose and the 2D pose projected from 3D hand pose prediction.In addition, they include a constraint based on bio-mechanical feasibility, such as bone lengths and joint angles.The temporal constraints indicate the smoothness of hand pose and mesh predictions over time.
Yang et al .proposed the combination of pseudo-labeling and consistency training [102].In pseudo-labeling, the generated pseudo-labels are corrected by fitting the hand template model.In addition, the work enforces consistency losses between the predictions of differently augmented instances and between the modalities of 2D hand poses and hand masks.
Spurr et al .applied adversarial training to a sequence of predicted hand poses [84].The encoder network is expected to be improved by fooling a discriminator that distinguishes between plausible and invalid hand poses.

Domain adaptation
Domain adaptation aims to improve model performance on target data by learning from labeled source data and target data with limited labels.This study has addressed two types of underlying domain gaps: between different datasets and between different modalities.
The former problem between different datasets is a common domain adaptation problem where the source and target data are sampled from two datasets with different image statistics, e.g., sim-to-real adaptation [41,90] (see Fig. 12).The model has access to readily available synthetic images with labels and target real images without labels.The latter problem between different modalities is characterized as modality transfer where the source and target data represent the same scene, but their modalities are different, e.g., depth vs. RGB (see Fig. 13).This aims to utilize information-rich source data, e.g., depth images contain 3D structural information, for inferring easily available target data (e.g., RGB images).
To reduce the gap between the two datasets, two major approaches have been proposed: generative methods and adversarial methods.In generative methods, Qi et al .proposed an image translation method to alter the synthetic textures to realistic ones and train a model on generated real-like synthetic data [67].
Adversarial methods enforce matching two domains' features so that the feature extractor can encode features even from the target domain.However, in addition to the domain gap in an input space (e.g., the difference in backgrounds), the gap in a label space also exists in this task, which is not assumed in typical adversarial methods [20,93].Zhang et al .developed a feature matching method based on Wasserstein distance Fig. 12 Poor generalization to an unknown domain [41].The models trained on synthetic images (source) exhibit a limited capacity for inferring poses on real images (target).[66].Their training is facilitated by matching two features from paired images, e.g., (RGB, depth) and (depth, IR).Baek et al .newly defined the domain of hand-only images where a hand-held object is removed.The work translates hand-object images to hand-only images by using GAN and mesh renderer [1].Given a hand-object image with an unknown object, this method can generate hand-only images, from which hand pose estimation is more tractable.
6 Future Directions

Flexible camera systems
We believe that hand image capture will feature more flexible camera systems, such as using firstperson cameras.To reduce the occlusion effect without the need for hand markers, recently published hand datasets have been acquired by multi-camera setups, e.g., DexYCB [9], Inter-Hand2.6M[55], and FreiHAND [110].These setups are static and not suitable for capturing dynamic user behavior.To address this, a firstperson camera attached to the user's head or body is useful because it mostly captures close-up hands even when the user moves around.However, as shown in Table 1, existing first-person benchmarks have a very limited variety due to heavy occlusion, motion blur, and a narrow field-of-view.
One promising direction is a joint camera setup with first-person and third-person cameras, such as H2O [43] and AssemblyHands [61].This results in flexibly capturing the user's hands from the first-person camera while taking the benefits of multiple third-person cameras (e.g., mitigating the occlusion effect).However, the first-person camera wearer doesn't always have to be alone.Image capture with multiple first-person camera wearers in a static camera setup will advance the analysis of multi-person cooperation and interaction, e.g., game playing and construction with multiple people.

Various types of activities
We believe that increasing the type of activities is an important direction for generalizing models to various situations with hand-object interaction.A major limitation of existing hand datasets is the narrow variation of users' performing tasks and grasping objects.To avoid object occlusion, some works did not capture hand-object interaction [55,105,109].Others [9,25,31] used pre-registered 3D object models (e.g., YCB [7]) to simplify inhand object pose estimation.User action is also very simple in these benchmarks, such as pick and place.
From an affordance perspective [30], diversifying the object category will result in increasing hand pose variation.Potential future works will capture goal-oriented and procedural activities that naturally occur in our daily life [16,24,79], such as cooking, art and craft, and assembly.
To enable this, we need to develop portable camera systems and robust annotation methods for complex backgrounds and unknown objects.In addition, occurring hand poses are constrained to the context of the activity.Thus, pose estimators conditioned by actions, objects, or textual descriptions of the scene will improve estimation in various activities.

Towards minimal human effort
Sections 4 and 5 separately explain efficient annotation and learning.To minimize the effort of human intervention, utilizing findings from both annotation and learning perspectives is one of the promising directions.Feng et al .exploited active learning that optimizes which unlabeled instance should be annotated and semi-supervised learning that jointly utilizes labeled data and large unlabeled data [19].However, this method is constrained to triangulation-based 3D pose estimation.As we mentioned in Section 4.4, another major computational annotation is model fitting; thus, we still need to consider such a collaborative approach in the annotation based on model fitting.
Zimmermann et al .also proposed a framework of human-in-loop annotation that inspects the annotation quality manually while updating annotation networks on the inspected annotations [109].However, this human check will be a bottleneck in large dataset construction.The evaluation of annotation quality on the fly is a necessary technique to scale up the combination of annotation and learning.

Generalization and adaptation
Increasing the generalization ability across different datasets or adapting models to a specific domain is a remaining issue.The bias of existing training datasets hinders the estimators from inferring test images captured under very different imaging conditions.In fact, as reported in [27,110], models trained on existing hand pose datasets poorly generalize to other datasets.For real-world applications (e.g., AR), it is crucial to transfer models from indoor hand datasets to outdoor videos because common multi-camera setups are not available outdoors [62].Thus, aggregating multiple annotated yet biased datasets for generalization and robustly adapting to very different environments are important future tasks.

Summary
We presented the survey of 3D hand pose estimation from the standpoint of efficient annotation and learning.We provided a comprehensive overview of this task and modeling, and open challenges during dataset construction.We investigated annotation methods categorized as manual, synthetic-model-based, hand-marker-based, and computational approaches, and examined their respective strengths and weaknesses.In addition, we studied learning methods that can be applied even when annotations are scarce, namely self-supervised pretraining, semi-supervised learning, and domain adaptation.Finally, we discussed potential future advancements in 3D hand pose estimation, including next-generation camera setups, increased object and action variation, jointly optimized annotation and learning techniques, and generalization and adaptation.

Fig. 3
Fig. 3 Difficulty of hand pose annotation in a single RGB image [81].Occlusion of hand joints is caused by (a) articulation, (b) viewpoint bias, and (c) grasping objects.

Fig. 6
Fig.6Calculation of joint positions from tracked markers[105].S i denotes the position of the markers, and W , M i , P i , D i , and T i are the positions of hand joints listed from the wrist to the fingertips.

Fig. 13
Fig. 13 Example of modality transfer.During training, RGB and depth images are accessible and RGB images are given in the test phase.The training aims to utilize the support of depth information to improve RGB-based hand pose estimation.
and proposed adaptive weighting to enable matching only for features related to hand characteristics, except for label information[106].Jiang et al .utilized an adversarial regressor and optimized the domain disparity by a minimax game[41].Such minimax of disparity is effective in domain adaptation of regression tasks, including hand pose estimation.As for the modality transfer problem, Yuan et al .and Rad et al .attempted to use depth images as the auxiliary information during training and test the model on RGB images [70, 104].They observed that learned features from depth images could support RGB-based hand pose estimation.Park et al .transferred the knowledge from depth images to infrared (IR) images that have less motion blur

Table 1
Taxonomy of methods for annotating 3D hand poses.We categorize the annotation methods as manual, synthetic-model-based, hand-marker-based, and computational annotation.
of objects, including hand-held objects (i.e., foregrounds) and backgrounds, is potentially diverse.

Table 2
Pros and cons of each annotation approach.
joint position.To reduce such unavoidable errors, subsequent annotation methods based on multicamera setups provide further accurate annotations (see Section 4.4).