Machine learning and computer vision have become increasingly integrated with healthcare in the medical community. This is apparent in the myriad of tasks, such as tumor segmentation [1], technical skill assessment [2,3,4,5,6], and tool detection and tracking [7,8,9,10]. Here we study the problem of articulated hand pose tracking in the surgical domain. Tracking hand poses can facilitate other useful tasks, such as technical skill assessment, temporal action recognition, and training surgical residents. Pose tracking in the computer vision community is primarily centered around human poses [11,12,13,14,15,16,17,18,19], while medical works focus on detecting and tracking surgical instruments [7,8,9,10]. Tracking surgical instruments is useful but these instruments are inherent to the surgical procedures seen during training. Instead we abstract away the emphasis on surgical instruments where articulated hand tracking will be more applicable to broad surgical tasks. Articulated hand pose tracking can highlight important properties such as grip, motion, and tension that human experts often attend to when evaluating videos.

A challenge in pose tracking is the temporal consistency of predictions between frames, the lack of which leads to flickering and improbable changes in estimated poses. Existing works [11, 14, 17,18,19] in articulated pose tracking use frame-wise independent predictions along with post-processing when tracking [12, 13, 15, 16] to gather temporal context. However, they do not integrate past inferences when localizing joints. We address this by proposing CondPose, a new model that performs predictions conditioned on the pose estimates from prior frames. In Fig. 1, we show a comparison of both approaches: the baseline using frame-wise independent predictions and our model using conditional predictions. The initial estimate may fluctuate due to varying factors, such as lighting, hand orientation, or motion blur. But we find that using prior predictions as guidance, we can improve our localization accuracy. The internal representation of this object’s state (position, appearance, and classification) is a function of its current state and previous states. By learning this Markovian prior for the prediction of hand joints, we can improve both pose estimation and consequently tracking accuracy.

Fig. 1
figure 1

On the left, a method only performing frame-wise independent predictions may miss out on properly localizing joints, while on the right, temporally passing past predictions from previous frames improves the network’s localization

There is a lack of data and benchmarks for articulated hand pose tracking. To address this, we collect a novel dataset featuring intra-operative videos of real surgeries, Surgical Hands. We annotate the articulated hand poses of surgeons which subsumes both surgical instrument and non-instrument actions, e.g., suturing, knot-tying, and gesturing. We are, to the best of our knowledge, the first to introduce a labeled dataset for both detection and tracking of multiple articulated hand poses. We benchmark our dataset against existing tracking baselines and demonstrate the superiority of our proposed approach on both hand pose estimation and tracking.

Our contributions are as follows:

  • We introduce CondPose, a novel deep network that takes advantage of confident prior predictions to improve localization accuracy and tracking consistency.

  • We present Surgical HandsFootnote 1, a new video dataset for multi-instance articulated hand pose estimation and tracking in the surgical domain.

  • We set new state-of-the-art benchmark performance on Surgical Hands.

Related works

Articulated pose estimation and tracking

Surgical instruments

Data-driven methods in the medical video domain primarily involve RAS videos. Works in this space [3,4,5] traditionally use kinematic data directly, requiring an external apparatus to capture these measurements. But full kinematic information is only available for robotic-controlled tools, even less so for hand-held instruments. Adding any external apparatus to capture kinematic data can negatively impact the costs, flexibility, and performance of certain operations. For detection, pure computer vision-based approaches extract information directly from video data to perform object detection. Many vision works use a region proposal network to perform localization [7, 20, 21], segmentation [9, 22], and articulated pose estimation [8, 23] from images.

To incorporate tracking, existing works may use a similarity function based on weighted mutual information [24] or Bayesian filtering as part of a minimization problem [25]. Nwoye et al. [10] are the first to measure the Multiple Object Tracking Accuracy (MOTA) [26] for surgical instruments in this setting, using a weakly-supervised approach with coarse binary labels indicating the presence or absence of seven surgical instruments. However, their evaluation contains at most one unique type of tool at each frame; hence, can be narrowed down to an object detection problem. Unlike their work, we track multiple instances of the same object in each frame. We also use MOTA as part of our benchmark when tracking hands in our videos.

Human pose

Pose estimation and tracking is commonly applied to images and videos of people, grouped into top-down [12,13,14,15,16] and bottom-up [17,18,19] strategies. Top-down methods detect all persons from an image, then regress each human pose independently using a pose estimation network. Bottom-up methods detect all joints in an image, and use bipartite matching and graph minimization techniques to assign joints to each person. As top-down approaches typically perform best in practice, we follow this paradigm. For tracking, [12] uses a greedy matching from IoU (intersection-over-union) overlap and optical flow to propagate bounding boxes between frames, [13] use deformable convolutions to warp predictions between frames, and [15] introduce a Graph Convolutional Network (GCN) [27] to match learned embeddings between human poses. A GCN is a neural network whose input consists of a set of nodes and edges, performing convolution operations on the relations of nodes. The inherent structure of this graph can improve quality of learned features as well as abstracting from limitations of a 2D space. These approaches spatially shift pose predictions, which cannot overcome certain factors (e.g., missed detections). In contrast, we address this problem at the detection step by integrating past pose observation(s) into each new predicted output.

Fig. 2
figure 2

The baseline generates a heatmap, \(\hat{\mathcal {H}'_t}\), for each detection using a pose estimation network. In our model, we provide additional information by incorporating a heatmap prior from \(t - \delta \). Concatenating the image features at t with \(\hat{\mathcal {H}}_{t-\delta }\), we pass this through our attention mechanism to produce a weighted heatmap prior, \(\hat{\mathcal {H}}'_{t-\delta }\). Both \(\hat{\mathcal {H}'_t}\) and \(\hat{\mathcal {H}}'_{t-\delta }\) are concatenated and passed through the fusing module, using context from both heatmaps to produce the final articulated hand pose. (The initial and final heatmaps represent real outputs from the network, while the heatmap prior (during training) shows ground truth at \(t - \delta \))

Hand pose

Current works on 2D hand pose estimation [28,29,30] are analogous to human pose estimation. Zhang et al. [31] performs pose tracking, using a disparity map from stereo camera inputs to estimate a 3D hand pose. However their data consists of only a single subject’s hand and at most one detection per frame. There are many image datasets [28, 30,31,32] for hand pose estimation, from a combination of manual, synthetic, and predicted annotations. But none satisfy the conditions of multiple object instances and tracking from video, more so in a surgical setting. Therefore, we introduce the Surgical Hands dataset for multi-instance articulated hand pose tracking. Our dataset includes varying lighting conditions, fast movement, and diversity in scene appearances. Distinctively, we also include gloved hands, which appear in contrasting colors such as latex and green.


We propose CondPose, to perform articulated pose detection and tracking by incorporating previous observations as prior guidance. We show our model in Fig. 2. While the baseline produces a heatmap from each hand using a pose estimation network, we leverage past predictions to produce conditioned hand pose outputs, improving detection performance during inference. While we design CondPose with video data in mind, we begin with pretraining on image data, finetuning on our video dataset, Surgical Hands, and lastly, comparing between different tracking methods.

Hand pose estimation in images

We first pretrain on image data, defining the input and output for the pose estimation network, P, as \(\hat{\mathcal {H}} = P(\mathcal {I})\). The input is an image crop \(\mathcal {I}\), \(\mathcal {I} \in \mathbb {R}^{H \times W \times 3}\), and the output is a predicted heatmap \(\hat{\mathcal {H}}\), \(\hat{\mathcal {H}} \in \mathbb {R}^{H' \times W' \times J}\). Here HW represents the input image height and width and \(H', W'\) are the output heatmap height and widths. J represents the number of predicted joints of each hand. Each image crop is scaled to 2.2 times the total area of the hand bounding box. We train using the mean squared error (MSE) between the ground truth and predicted heatmaps as \(\mathcal {L} = \Vert (\mathcal {H} - \hat{\mathcal {H}}) \odot \mathcal {M} \Vert ^2\). The ground truth heatmaps, \(\mathcal {H}\), are generated from 2D Gaussians centered on each annotated keypoint. \(\mathcal {M}\), is included to mask out un-annotated joints. The output joint locations are the max value positions in the third channel of \(\hat{\mathcal {H}}\). After pretraining, we finetune our model on videos to learn conditional hand pose predictions.

Hand pose estimation in videos

While image data cannot be used to learn our conditional hand pose predictions, we can initialize weights to speed up our training process and improve generalizability. We finetune CondPose on Surgical Hands, shown in the top portion of Fig. 2. To incorporate a prior branch, we introduce a heatmap prior, \(\hat{\mathcal {H}}_{t-\delta }\), a pose estimate of the same object from \(t-\delta \). Our model performs conditional predictions, defined as

$$\begin{aligned} \hat{\mathcal {H}}_t = M_\mathrm{fus}(P (\mathcal {I}_t); M_\mathrm{att}(v_t; \hat{\mathcal {H}}_{t-\delta })) . \end{aligned}$$

In contrast to our previous definition of P, \(\hat{\mathcal {H}}_t\) is now conditioned on predictions at a previous time step \(t - \delta \). Our model is further composed of two branches: the attention mechanism, \(M_\mathrm{att}\), and the fusing module, \(M_\mathrm{fus}\). \(M_\mathrm{att}\) contextualizes the prior heatmap prediction, \(\hat{\mathcal {H}}_{t-\delta }\), with image features, \(v_t\) (\(conv\_1\) in our experiments), at time t. This branch relates the visual representation and the localized heatmap prior, ideally learning to weight each joint prior accordingly. \(M_{fus}\) produces a merged final heatmap from the initial prediction, \(\hat{\mathcal {H}}'_t\), and weighted heatmap prior, \(\hat{\mathcal {H}}'_{t-\delta }\). \(M_{att}\) and \(M_{fus}\) are both composed of two convolutional layers, followed by transposed convolution, with ReLU nonlinearities in-between.

During training the prior is selected from frame \(t - \delta \). If the object does not exist at that frame, we use earlier frames up until the first occurrence. If a corresponding object does not exist on any previous frames, then the prior, \(\hat{\mathcal {H}}_{t-\delta }\), is set as a zeros heatmap. This is expected behavior during evaluation, because priors do not yet exist at frame one. Also during evaluation, unlike training, the prior associated with the current detection is unknown. Given n priors from time \(t-1\), \(\{ \hat{\mathcal {H}}_{t-1}^{1}, \hat{\mathcal {H}}_{t-1}^{2}, \ldots \hat{\mathcal {H}}_{t-1}^{n}\}\), and k detections at time t, \(\{ \hat{\mathcal {I}}_{t-1}^{1}, \hat{\mathcal {I}}_{t-1}^{2}, \ldots \hat{\mathcal {I}}_{t-1}^{k}\}\) we pass all pairs through the network to generate candidates. The heatmap with the highest average confidence score is selected as the output for that detection.

Fig. 3
figure 3

We show samples from our annotations. Each hand is labeled with a bounding box, handedness, tracking id, and visibility of joints

Matching strategies for tracking

Following the detect-then-track paradigm, we require a matching strategy to performing tracking. Given n hands at time \(t-1\) and m hands at time t we use a similarity function to derive similarity measures between each pair at \(t-1\) and t. Common methods are intersection-over-union (IoU) of bounding boxes, average L2-distance of the predicted joint locations, or L2-distance between the graph pose embeddings. Similar to Ning et al. [15] we train a GCN to output the embedding of each input hand pose, \(\mathcal {X}\), defined simply as \(\hat{p} = GCN(\mathcal {X})\). Here \(\mathcal {X} \in \mathbb {R}^{J \times C}\), where J is the number of joints and C is the number of channels. For training, we use the contrastive loss [33], \(\mathcal {L} = \frac{1}{2} \left( y * d + (1 - y) * max \left( 0, (m - d)^2 \right) \right) \). The contrastive loss places embeddings close in perceptual distance. For a pair of embeddings \(\hat{p}_v^1\) and \(\hat{p}_v^2\), the variable d represents the L2-distance between the two, \(d = \Vert \hat{p}_v^1 - \hat{p}_v^2\Vert ^2\). y is a binary label indicating the same hand, 1, or different hands, 0. m is the margin variable, a hyperparameter used for tuning. For each item in our minibatch, positive pairs are selected between adjacent frames with probability \(p=0.5\) and negative pairs are selected from the same video with \(p=0.4\) or from a different video with \(p=0.1\). We evaluate our trained GCN models using the classification accuracy between pairs of selected hands, achieving classification accuracies of \(>97\%\).


We lack data for training and benchmarking models on multi-instance hand tracking. Therefore we introduce Surgical Hands, a novel video dataset for multi-instance articulated hand pose estimation and tracking in the surgical domain, the first of its kind. From publicly available data, we collect \(28\) videos with a view of the hands of surgical team members during the operation. From those videos, we extract \(76\) clips sampled at 8 frames per second and collect bounding box, class label, tracking id, and pose annotations using Amazon Mechanical Turk (AMT) and a modified version of Visipedia Annotation Tools.Footnote 2 We show samples of our annotations in Fig. 3. Each hand is labeled with the handedness (left/right), 21 joints, and properties for each joint: visible, occluded or non-available. Visible implies that the joint is visibly on screen, occluded means the joint is obstructed but its position can be estimated, not-available means the joint position cannot be inferred or it is off-screen. From our collected data, we have a total 2, 838 annotated frames and \(8,178\) unique hand annotations from 21 unique annotators. Each annotated frame contains a mean of 2.88 hands, median of 3 hands, and a maximum of 7 hands.

Table 1 Mean Average Precision (mAP)
Fig. 4
figure 4

We show qualitative samples of frames from the best performing (top row) and lower performing (bottom row) videos. (Best viewed in color)

Experiments and evaluation

Implementation details

We adopt a ResNet-152 pose estimation model [12] to first train on hand pose image data, CMU Manual Hands and Synthetic Hands [28]. We use a batch size of 16, training for 30 epochs, with an Adam optimizer and a learning rate of \(1e^{-3}\). When finetuning on Surgical Hands  we use leave-one-out cross-validation and split our data into \(28\) different folds. Clips belonging to the same video are in the same validation fold, and the reported metrics are averaged across all folds. We employ a variant of curriculum learning that gradually transitions to predicted priors from ground truth priors. A predicted prior at \(t-\delta \) is sampled with a probability of \(p = 0.10 * epoch\), until only predictions are used for training at epoch 10 and onward. We empirically select \(\delta =3\) during training. For all training, we apply random rotations and horizontal flipping as data augmentation. When training the GCN for tracking, we using a batch size of 32 and train for 60 epochs and an initial learning rate of \(1e^{-3}\). We normalize \(\mathcal {X}\) to 0-1, relative to keypoint positions along the bounding box. The input dimension for each input is \(J \times C\) where J represents the number of joints and C is the number of channels. We use \(C=2\) for x-y coordinates and \(C=3\) to include annotation state (0 = unannotated, 1 = annotated, or 0-1 for predicted keypoints). We adopt a two-layer Spatio-Temporal GCN [15, 34] to output a 128-dimensional embedding of each pose.

Table 2 We optimize for the multiple Object Tracking Accuracy (MOTA), each performance metric is averaged across all validation folds

Detection performance

We evaluate detection performance using mean Average Precision (mAP), the choice metric in human pose evaluation, on our Surgical Hands dataset. MAP is computed using the Probability of Correct Keypoints (PCK), measuring the probability of correctly localizing keypoints within a normalized threshold distance, \(\sigma \). This threshold distance, \(\sigma \)=0.2, is empirically chosen to be roughly the ratio between the length of a thumb joint and the enclosing bounding box. Pose predictions are matched to ground truth poses based on the highest PCK and unassigned predictions are counted as false positives. AP for each joint is computed and mAP is reported across the entire dataset. In Table 1 we report the mAP at the highest MOTA score (defined in the next section) for each model. With our recursive heatmap strategy we are able to obtain higher average precision across the different joints in the hand. In Fig. 4 we show qualitative examples of our hand pose estimation on various frames from our Surgical Hands dataset. The top row clips are sampled from the best performing clips, while the bottom row are from the worst performing clips. We see that the model suffers most in cases of heavy occlusion, where the camera view excludes the majority of the hand. Ambiguity in the position of the hand furthers the localization errors, e.g., top-down view with most fingers occluded. The best performing cases are those with balanced lighting and an unambiguous view of the first few digits.

Fig. 5
figure 5

We show a qualitative comparison between the baseline model and our method. We note a higher recall and consistency between frames, as shown for the hand to the left. Even when the pinky finger is not visible, the past predictions reinforces those joint locations

Table 3 MOTA performance between matching strategies, averaged across all folds. Each row is optimized for highest MOTA performance. Matching strategies share the same base model, so it is possible for them to share the same mAP score

Tracking performance

To measure tracking performance, we use Multiple Object Tracking Accuracy (MOTA) which also takes into account the consistency of localized keypoints between frames. MOTA [26] is defined as:

$$\begin{aligned} \mathrm{MOTA} = 1 - \frac{\sum _t \left( \mathrm{FN}_t + \mathrm{FP}_t + \mathrm{IDSW}_t \right) }{\sum _t G_t} \end{aligned}$$

This encapsulates errors that may occur during multiple object tracking: false negatives (FN), false positives (FP), and identity switches (IDSW). FN are joints for which no hypothesis/prediction was given, FP are the hypothesis for which no real joints exists, and IDSW are occurrences where the tracking id for two joints are swapped. G represents the total number of ground truth joints. The range of values for the MOTA score is \((- \infty \) to 100].

We measure perform tracking using three methods: IoU, L2-distance, and GCN. Intersection-over-union (IoU) measures overlap of two bounding boxes using the ratio: area of intersection over total area, between subsequent frames in our case. L2-distance measures the average L2 distance of regressed keypoints between frames. GCN measures the embedding similarity between the encoded keypoints to determine matches. We show quantitative results from our experiments in Table 3 and the per-joint performance in Table 2. Each row is maximized for the highest MOTA score across all hyperparameters, shown along with its corresponding mAP. Our method has a higher MOTA score across all of the videos, but our corresponding mAP scores are greater by a much larger margin. This points to our advantage from temporally leveraging predictions from previous frames during the detection step. We show an example in Fig. 5, in a frame-by-frame comparison between the baseline and our method, we note a higher recall and improved localization. While the last digit is obstructed, its position can be reasonably inferred. In the last two columns of Table 3 we use an object detector to detect hands, the prior two columns (perfect detections) use the manual annotations. Training an object detector on 100 Days of Hands (100DOH) [35], we see a lower localization and tracking accuracy but a consistent trend from the baseline. The quality of the detections serve as a bottleneck, but the margins of improvements are very similar. While trained with perfect detections as priors, they are not required to maintain performance in practice.

Table 4 Ablation analysis using IoU matching strategy (\(\delta =1\))
Table 5 Effect of \(\delta \)
Fig. 6
figure 6

Optimized for maximum MOTA score, we show the top performing models on PoseTrack18. Consistent with our earlier findings, our model maintains a higher mAP for comparable MOTA scores

Ablation analysis

We perform an ablative analysis on the convolutional map in \(M_\mathrm{att}\) and the fusing module \(M_\mathrm{fus}\). We experiment with no prior convolutional feature map (NC), no attention mechanism (NA), and removal of both (NC-NA), showing our results in Table 4. Our full model has the highest scores overall. The attention mechanism and convolutional feature maps have opposing effects on the mAP and MOTA scores. The NC model does not use a convolutional feature map from frame t, so the fusing module is applied directly to both un-altered heatmaps from \(t-\delta \) and t. We found this increases the mAP value, but lowers the MOTA score. The NA model directly concatenates the convolutional features and the heatmaps, with no attention mechanism. This has the opposite effect, decreasing the mAP significantly but slightly increasing the overall MOTA score. Without contextual convolutional features (NC and NC-NA), the model can still learn to use the prior prediction and improve its detection score. On the contrary, no attention mechanism brings a drop in mAP, which may be attributed to an unrefined prior with noisy features. The small increase in the MOTA score is likely from fewer false positives produced by that model, due to a slightly lower mAP.

We also explore the value of our hyperparameter, \(\delta \), during training. We use values \(\delta =\{1,2,3,4\}\) and show our results in Table 5. Optimizing for highest MOTA score, we found \(\delta =3\) to be best with 39.31, followed by \(\delta =1\) with a smaller MOTA score (39.03) but a higher mAP (58.64 vs 56.66). We find a nonlinear correlation between the mAP and MOTA scores, showing a trade-off in mAP when optimizing for the tracking performance. The best strategy is one that maximizes MOTA accuracy with minimal loss in localization precision.

Evaluation on human pose

We executed additional experiments on the PoseTrack18 dataset between our model and our re-implementation of the baseline. From Fig. 6, we show a narrowed gap in performance but our findings are consistent with our earlier experiments. Our model maintains a higher mAP score for the highest MOTA values. Given the trade-off that occurs between mAP and MOTA, this means our model is more likely to retain its localization precision at higher tracking accuracies.


In this work, we introduce Surgical Hands, the first articulated multi-hand pose tracking dataset of its kind. Additionally we introduce CondPose, a novel network that makes conditional hand pose predictions by incorporating past observations as priors. We show that when compared with a frame-wise independent strategy, we have better performance in localizing and tracking hand poses. More so, a higher localization accuracy for comparable tracking performance. While tracking drives the consistency of joints through time, the actual shape and characteristics of the hand is described by the localization precision. With a higher localization precision and better tracking still, we can guarantee a better representation of the hands in the scene. While not the focus of this work a reliable hand tracking method can provide a salient signal that can be used to approximate surgical skill or understanding actions.