Temporally guided articulated hand pose tracking in surgical videos

Purpose Articulated hand pose tracking is an under-explored problem that carries the potential for use in an extensive number of applications, especially in the medical domain. With a robust and accurate tracking system on surgical videos, the motion dynamics and movement patterns of the hands can be captured and analyzed for many rich tasks. Methods In this work, we propose a novel hand pose estimation model, CondPose, which improves detection and tracking accuracy by incorporating a pose prior into its prediction. We show improvements over state-of-the-art methods which provide frame-wise independent predictions, by following a temporally guided approach that effectively leverages past predictions. Results We collect Surgical Hands, the first dataset that provides multi-instance articulated hand pose annotations for videos. Our dataset provides over 8.1k annotated hand poses from publicly available surgical videos and bounding boxes, pose annotations, and tracking IDs to enable multi-instance tracking. When evaluated on Surgical Hands, we show our method outperforms the state-of-the-art approach using mean Average Precision, to measure pose estimation accuracy, and Multiple Object Tracking Accuracy, to assess pose tracking performance. Conclusion In comparison to a frame-wise independent strategy, we show greater performance in detecting and tracking hand poses and more substantial impact on localization accuracy. This has positive implications in generating more accurate representations of hands in the scene to be used for targeted downstream tasks.


I. INTRODUCTION
Machine learning and computer vision have become increasingly integrated with healthcare in the medical community.This is apparent in the myriad of tasks such as cancer [1] and diabetic retinopathy classification [2], tumor segmentation [3], clinical competence and technical skill assessment [4]- [8], and tool detection and tracking [9]- [12].In this work we open avenues for new problems introduced by the use of novel automatic articulated hand pose tracking in the surgical domain.Detecting and Tracking hand poses can enable us to perform other useful tasks such as technical skill assessment, temporal action recognition, and training surgical residents.Currently these tasks may require tedious manual review from health care professionals, but by capturing and leveraging rich video features we can automate many of these processes in the future.
Articulated pose tracking in the computer vision community is primarily centered around human poses [17]- [25], while in the medical community works have focused on detection and tracking of surgical instruments [9]- [12].Detecting and tracking surgical instruments can be useful for additional tasks, but the instruments are inherent only to the surgical procedures used during training.We seek to bring articulated hand pose tracking into the surgical domain, while abstracting away the TABLE I: We compare our proposed dataset to other existing hand pose datasets.Our data supports multiple object instances, along with tracking, in each clip.

Name
Labels Environment # dets Detection Multi-instance Tracking CMU Manual Hands (MPII+NZSL) [13] Manual In-the-wild + Same background 2.8k CMU Synthetic Hands [13] Synthetic Renderer 14.2k Panoptic Hands [13] Trained Network Multiview-camera studio 14.8k LSMV 3D [14] Leapmotion sensor Research office 184k Freihand [15] Hybrid Green screen + Indoor/outdoor 36.5kSTB [16] Manual Indoor 18k Surgical Hands (Ours) Manual Operating rooms 8.1k emphasis on surgical instruments.In our view, direct articulated hand tracking will be more widely applicable to various surgical tasks than instrument tracking.With articulated hand pose tracking, we can highlight important properties such as grip, motion, and tension that human experts often attend to when evaluating videos for technical skill, which are agnostic to the surgical instrument or procedure being carried out.
One of the challenges in pose tracking is the temporal consistency of predictions between frames, the lack of which can lead to flickering and unfeasible changes in estimated poses.Most of the works [17], [20], [21], [23]- [25] in articulated pose tracking use frame-wise independent predictions.There are works in human pose tracking that use optical flow [18] and deformable convolutions [19] to spatially transform predictions or 3d-convolutions [22] to gather temporal context, but they do not integrate past inferences when producing new predictions.
Therefore we propose Res152-CondPose, a new model that performs predictions conditioned on the pose estimates of prior frames.In Fig. 1, we show a comparison of both approaches: a model that uses frame-wise independent predictions and our model which uses conditional predictions.The model's pose estimate between frames may fluctuate due to varying factors such as lighting, hand orientation, or motion blur.But we show that using prior predictions as guidance, we can improve our localization accuracy.When observing a hand in a video, the internal representation of this object's state (position, appearance, and classification) is a function of its current state and its previous states.Therefore in tracking, it intuitively makes sense to make decisions about an object's state using previous states as well.Our model uses a deep artificial neural network to perform articulated hand pose tracking and a recursive heatmap detection scheme to improve the localization during tracking.By learning this Markovian prior for the prediction of hand joints, we can improve both pose estimation and consequently final tracking accuracy performance.However, to validate our idea, we lack a major benchmark on articulated hand pose tracking in the surgical domain.Active research for computer vision in the medical field predominantly use robotic-assisted surgery (RAS) [9]- [12], [26], [27] videos, with detection and tracking annotations that are instrument or procedure specific and naturally cannot extend to novel videos containing unseen instruments.Instead, we annotate the articulated hand poses of surgeons which subsumes both surgical instrument and non-instrument actions, e.g.suturing, knot-tying, and gesturing.We collect a novel dataset named Surgical Hands.Our dataset consists of intraoperative videos of real surgeries featuring hand use in many instrument and non-instrument actions.From publicly available videos we extract video clips and collect articulated pose annotations that label all of the visible hands and hand joints in the scene.We are, to the best of our knowledge, the first to introduce a labeled dataset for both detection and tracking of multiple articulated hand poses.We benchmark our dataset against existing tracking baselines and demonstrate the superiority of our proposed approach on both hand pose estimation and tracking.
Our contributions in this work are as follows: • We introduce Res152-CondPose, a novel deep network that takes advantage of confident prior predictions to improve localization accuracy and tracking consistency.
• We present Surgical Hands, a new video dataset for multiinstance articulated hand pose estimation and tracking in the in-vivo surgical domain.
• We set new state-of-the-art benchmark performance on Surgical Hands.

A. Detection and Tracking in the Medical Domain
Data-driven methods in the medical video domain, predominantly consist of RAS videos.Unlike our work, RAS videos exclusively contain surgery-specific instruments.Works in this area perform subtasks such as surgical instrument detection, localization, tracking, and skill ranking [9]- [12], [26], [27].For detection, methods [5]- [7] have traditionally used kinematic data directly which requires an external apparatus, e.g. the da Vinci Surgical System (dvSS), to capture these measurements.But full kinematic information is only available for robotic-controlled tools, even less so for handheld instruments.Adding any external apparatus to capture kinematic data can negatively impact the costs, flexibility, and performance of certain operations.
In contrast to manually measuring kinematics, computer vision-based approaches extract information directly from video data to perform object detection.Khalid et al. [28] and Jin et al. [27] both use a vanilla region proposal network, such as FasterRCNN [29], as their base detection architecture and perform localization tasks on the resulting bounding boxes.Sarikaya et al. [9] modified FasterRCNN to use optical flow frames as an additional branch input to detect the pincers of robotic tools.Laina et al. [30]  Fig. 2: From the current video frame, we crop and center every detection to the input image, I n t .The baseline generates a heatmap, Ĥ t , for each detection using a pose estimation network [18].In our model, we provide additional information to this heatmap by incorporating a heatmap prior from t − δ.We concatenate the conv 1 features at t with Ĥt−δ , the predicted output of the same object at t − δ.We pass this through our attention mechanism to produce a weighted heatmap prior, Ĥ t−δ , adding the context of conv 1 features.Both Ĥ t and Ĥ t−δ are concatenated and passed through the fusing module, using features from both heatmaps to produce the final articulated hand pose estimate.solving a segmentation and localization task.Ni et al. [11] performed semantic segmentation on different robotic tools by introducing RASNet, a U-shaped network that included a global attention mechanism.From these RAS works, Du et al. [31] is most similar to ours.They construct the articulated pose from the detected points and joints in a bottom-up fashion, using bipartite graph matching.Conversely, we estimate the joints of hands from bounding box detections using a top-down approach and perform tracking from those detections.In a follow-up work [10] the authors use a 3D network architecture and provide more labeled annotations.
For tracking, earlier published works have learned a similarity function based on weighted mutual information [32] or a unified detection-tracking framework [33] that utilizes Bayesian filtering and treats detection and surgical instrument tracking as a minimization problem.Later, Nwoye et al. [12] introduced a weakly-supervised approach that uses coarse binary labels to indicate the presence or absence of seven surgical tools.They are the first to use the Multiple Object Tracking Accuracy (MOTA) [34] metric for surgical tool tracking for endoscopic surgery videos.However, their video data contain at most one unique type of tool at each frame; hence, it can be narrowed down to an object detection problem.Unlike their work, we support multiple tracking instances of the same object in each frame.In our work, we will be using MOTA as part of our performance metrics for tracking the hands of the surgeons in our videos.Most recently, Zhang et al. [35] presented an object detector trained on bounding box annotations of hands in operating rooms.This contrasts with our work where we provide articulated poses, integrate past predictions to improve pose estimations, and a benchmark for evaluating the quality of tracking.

B. Articulated Pose Estimation and Tracking
1) Human Pose: Articulated pose estimation and tracking is commonly applied to images and videos of people.Methods for this task are grouped into top-down [18]- [22] and bottomup [23]- [25].Top-down methods first detect and crop all persons from an image using an object detector.Then for each crop the human pose is regressed independently using a pose estimation network.Bottom-up methods localize all joints throughout the entire image, and then use bipartite matching and graph minimization techniques to assign joints to each person.As top-down approaches typically perform best in practice, we follow this paradigm.
In our work we borrow our pose estimation model from [18] which uses a ResNet architecture followed by deconvolutional layers to produce localization heatmaps of human joints.For tracking, their method uses greedy matching and select the highest intersection-over-union (IoU) overlap from detected bounding boxes between adjacent frames.To incorporate some temporal consistency, they include bounding boxes of poses propagated from previous frames using computed optical flow.However, this approach cannot improve predicted poses it can only spatially shift them from estimated optical flow.
Bertasius et.al. [19] perform another type of pose propagation by using deformable convolutions to apply a learned transform that warps the predicted pose from the first frame onto a second frame.They are similar to our work where they use the output heatmap from both frames to generate a new output, but pose warping does not overcome missed or erroneous pose predictions between frames; our method can overcome these problems because it integrates newly inferred pose at each frame before producing the new heatmap output.
Another method for tracking by Ning et al [21] utilizes a Graph Convolutional Network (GCN) [36] to generate embeddings from detected poses to match across video frames.This introduces robustness because, in spite of large camera movements, the human pose remains stable for small time steps.Ideally, GCN tracking would succeed in cases of considerable and shaky camera movements where IoU tracking may fail due to large changes in on-screen position.One drawback is that for a relatively stable camera position and multiple similar poses, e.g.on-stage dance routine, the embedded features become very similar and increasingly difficult to separate.Hence we introduce a visual feature embedding, extracted from each image crop, in addition to the embedded pose of each hand.
2) Hand Pose: Existing work on 2D hand pose estimation [13], [15], [37] is analogous to human pose estimation.However, we are not aware of existing deep learning-based methods evaluated on hand pose tracking.Simon et al. [13] provide a method that helps a network improve its detection of occluded joints by training on multiple views of the same scene.Santavas et al. [37] introduced a self-attention module to their estimation network, and Zimmerman et al. [15] introduce a dataset for the 3D pose estimation from RGB image task.We introduce a new video dataset for multi-instance articulated hand pose tracking in the in-vivo surgical environment.As shown in Table I, many existing datasets support detection with no temporal coherence between video frames.The only dataset that claims tracking is STB [16]; however, their dataset consists of only a single hand throughout their video sequences and at most one detection per frame.Our data includes varying lighting conditions, fast movement, and diversity in scene appearances.Distinctively, we also include gloved hands, which appear in contrasting colors such as latex and green.To our knowledge, we are the first to provide annotated data and a substantially strong baseline to compare against for the articulated hand pose tracking task.

III. METHOD
In the articulated pose tracking domain, many prior works have used models that only generate frame-wise independent pose predictions.Even optical flow [18] or deformable convolution [19] techniques do not account for confidences of past observations, which play an important role for temporallyinformed predictions.To overcome these limitations, we propose Res152-CondPose, which performs articulated pose detection and tracking by incorporating previous observations as prior guidance.
We show our model in Fig. 2. While the baseline simply produces a heatmap from each hand using a pose estimation network, we incorporate previous predictions into our final heatmap output.By leveraging past predictions from our model, we can produce conditioned hand pose outputs, improving detection performance during inference.Res152-CondPose is designed for video data, so we begin by pretraining our pose estimation network on an image dataset [13].Then, we finetune on our video dataset, Surgical Hands, training our model to make conditioned hand pose predictions.Last, we introduce and experiment with a learned joint visualpose embedding used within a graph convolutional network for tracking hand poses.

A. Hand Pose Estimation in Images
First, we begin by pretraining our pose estimation model on an image dataset [13].Our base network is borrowed from Xiao et al. [18] and uses a ResNet backbone architecture.This an encoder-decoder style network that upsamples feature maps into a final heatmap prediction, Ĥ.The details of this architecture are encapsulated into the pose estimation network shown in Fig. 2.
We define the basic input and output for the pose estimation network, P , as Ĥ = P (I).The input is an image crop I, I ∈ R H×W ×3 , and the output is a predicted heatmap Ĥ, Ĥ ∈ R H ×W ×J .Here H, W represents the input image height and width and H , W are the output heatmap height and widths.J represents the number of predicted joints of each hand.Each image crop is extracted from the object's bounding box, and the size is empirically chosen to be 2.2 times the total area of the object's bounding box.
We train our network using the mean squared error (MSE) loss between the ground truth heatmap, H, and the predicted heatmap, Ĥ, shown as (1) The ground truth heatmaps, H, are generated from 2D Gaussians centered on each annotated keypoint.Not all joints are visible and annotated, so a binary mask, M, is included to mask out those un-annotated joints.They will not factor into the loss calculation.During evaluation, the final joint location is from the position of the maximum value for each joint prediction in Ĥ.After completing the pretraining, we finetune our model on videos to learn conditional hand pose predictions.

B. Hand Pose Estimation in Videos
While an image dataset, from the previous section, cannot be used to learn our conditional hand pose predictions, we can still use it to learn weights that can speed up our training process and improve generalizability.We finetune our full network, Res152-CondPose, on our video dataset, Surgical Hands, as shown in the top portion of Fig. 2. To incorporate a prior branch, we introduce a heatmap prior, Ĥt−δ , a pose estimate of the same object from t − δ.Our model performs conditional predictions, which we define as Ĥt = M f us (P (I t ); M att (v t ; Ĥt−δ )). ( For each input frame, I t , we output the localization heatmap Ĥt .In contrast to our previous definition of P , this output is conditioned on the network's prediction at a previous time step t − δ.Our model is further composed of two branches: the attention mechanism, M att , and the fusing module, M f us .P is the base pose estimation network.The attention mechanism, M att contextualizes the prior heatmap prediction, Ĥt−δ , with the current image visual features, v t .The purpose of this branch is to relate the visual representation of a hand and its localized heatmap prior, ideally learning to weight each joint prior accordingly.We use the conv 1 feature output from the base network to represent the visual features from the image, and concatenate this with the aligned previous heatmap prediction.M att is composed of two convolutional layers, followed by transposed convolution, with ReLU non-linearities in-between.
Our fusing module, M f us , has an identical architecture to M att .However, this is trained to produce the final heatmap output from the concatenation of our initial prediction, Ĥ t , and our heatmap prior, Ĥ t−δ .This branch produces a merged final heatmap, Ĥt , using the Gaussian peak magnitudes and locations from both intermediate heatmaps.For example, if the prediction at t 1 correctly localizes a joint but it misses that same joint at t 2 , we expect the final output to be adjusted based on the confidences of each detection.
During training the prior is selected from frame t − δ.If the object does not exist at that frame, we use earlier frames up until the first.If a corresponding object does not exist on any previous frames, then the prior, Ĥt−δ , is set as a zeros heatmap.A zeros heatmap as a prior will be expected during evaluation, because priors do not yet exist at the start of a video.Also during evaluation, unlike training, the prior associated with the current detected hand is unknown.We are given n priors from time t − 1, { Ĥ1 t−1 , Ĥ2 t−1 , . . .Ĥn t−1 }, and k detections (image crops) at time t, { Î1 t−1 , Î2 t−1 , . . .Îk t−1 }.All prior and detection pairs are passed through the network to get the predicted heatmaps.But only the heatmap with the highest average confidence score (Gaussian peak) is selected as the output for that detection.

C. Matching Strategies for Tracking
After estimating the pose of the hands from each detection, we require a matching strategy to performing tracking.Given n hands at time t − 1, {o Here, we experiment with a modification to the graph pose embeddings that includes a visual representation of the object into the encoding.
Using IoU to match detections between time steps and perform tracking is typically the highest performing option.Operating strictly on bounding boxes of the entire hand, this strategy has the advantage of being most robust to pose prediction errors.However, the most common sources of failure from IoU are in cases of multiple overlapping objects or substantial movement.Ning et al. [21] generate a graph pose embedding from the keypoint positions of each detected person and use it to assign human poses between frames.However, we try concatenating the embedded pose again with the conv 1 output from the base network to encode visual information into this embedding as well.We hypothesize that this visualpose feature embedding can disambiguate between hand poses with similar pose configurations.
Similar to Ning et al. [21] we train a GCN to output the embedding of each input hand pose, X , defined simply as p = GCN (X ).Here X ∈ R J×C , where J is the number of joints and C is the number of channels.We follow this with a jointembedding layer, J , that outputs a visual-pose embedding from the concatenation of the visual features, v, and the pose embedding p, defined as pv = J (v; p).Our joint-embedding layer, J consists of two fully-connected layers separated by a ReLU function.For training, we use the contrastive loss [38] defined as The contrastive loss is used to place embeddings close in perceptual distance.Meaning, visually similar pairs will be close together and dissimilar pairs will be further away.For a pair of embeddings p1 v and p2 v , the variable d represents the L2distance between the two where . y is a binary label that indicates whether the two embeddings represent the same hand, 1, or different hands, 0. m is the margin variable, a hyperparameter used for tuning.
For each item in our minibatch, positive pairs are selected between adjacent frames with probability p = 0.5 and negative pairs are selected from the same video with p = 0.4 or from a different video with p = 0.1.We evaluate our trained GCN models using the classification accuracy between pairs of selected hands.We compare both the pose embedding and the visual-pose embedding in Sec.V. From our experiments, the GCN and GCN-joint visual models achieve classification accuracies of > 97% using features across the models.

IV. DATASET
To benchmark our method, we require data to train and evaluate our models and baselines.To that end, we introduce Surgical Hands, a novel video dataset for multi-instance articulated hand pose estimation and tracking in the surgical domain.We are the first to publish a labeled dataset for both detection and articulated hand pose tracking of multiple hands for videos.From publicly available videos, we selected 28 that had a view of the patient cavity and hands of the surgical team members during the operation.We then extracted 76 clips sampled at 8 frames per second and provided bounding box, class label, tracking id, and pose annotations for all hands in the scenes.We show samples of our annotations in Fig. 3.

A. Hand Pose Annotations
We used Amazon Mechanical Turk (AMT) to collect hand pose annotations and a modified version of Visipedia Annotation Tools 1 to generate bounding box annotations, class labels, and joint positions.Each hand is labeled as left or right, and consists of 21 joint annotations in total, consistent with prior works [13].A canonical skeleton from our annotations is shown in Fig. 4. Annotators are tasked with drawing bounding boxes around each visible hand in the frame, even partially occluded hands, and labeling the joints on each hand.The joints are labeled with a keypoint that holds three properties: visible, occluded, not-available.Visible implies that the joint is visibly on screen, occluded means the joint is obstructed but its position can be estimated, not-available means the joint position cannot be inferred or it is off-screen.In addition to the bounding boxes and the labeled keypoints, unique tracking IDs were manually assigned for all hands in the video clips.

B. Dataset Statistics
From our collected data, we have 2, 838 annotated frames from the 76 total clips, 8, 178 unique hand annotations and a total of 21 unique annotators.Each annotated frame contains a mean of 2.88 hands, median of 3 hands, and a maximum of 7 hands.Fig. 5 shows the joint annotation visibility across our labeled frames.We see that across all instances, the joints from the ring and pinky finger show the highest rate of being visually obscured or not-available.In the majority of our video clips, position and orientation of the hand makes it extremely difficult to localize those joints.This is expected because the 4 th and 5 th digits are underutilized in many procedures, the first 3 digits are typically used to hold surgical instruments.

A. Implementation Details
We begin by training the pose estimation model on an image dataset for hand pose estimation, we use the CMU Manual Hands + Synthetic Hands (Mixed Hands) [13] image dataset.We start with an ImageNet pretrained ResNet-152 network architecture and using a batch size of 16 we train for 30 epochs.We use an Adam optimizer and an initial learning rate 1 https://github.com/visipedia/annotation_toolsAfterwards, we train our GCN and joint-embedding layer using a batch size of 32 and train for 60 epochs and an intial learning rate of 1e −3 .Here our linear scheduler decays the learning rate at epochs 20 and 30 by factors of 10.To preprocess the input keypoints, X , we first normalize them to be between 0-1 relative to its position along the bounding box (i.e subtract top-left (x, y) from each joint coordinate and divide by (width, height), respectively).The input to the GCN follows the same convention as prior work, the dimension for each input is J × C where J represents the number of joints and C represents the number of channels.In our experiments, C = 2 for x-y coordinates only and C = 3 to include the annotation state of each keypoint (0 = unannotated or 1 = annotated).With C = 2, unannotated keypoints are given a default value of −1.

B. Detection Performance
To evaluate detection performance on our Surgical Hands dataset, we use the mean Average Precision (mAP) metric.mAP is also the selected metric for the PoseTrack [17] metrics in the human pose tracking domain, so we adopt this as well.The mAP is computed using the Probability of Correct Keypoints (PCK) metric.The PCK metric measures the probability of correctly localizing keypoints within a given normalized threshold distance, σ, shown as: We modify the PoseTrack off-line evaluation code to be compatible with our hand pose data and the distance is normalized to 0.2 times the bounding box size.0.2 was empirically chosen to be roughly the ratio between the length of a thumb joint and the enclosing bounding box.σ is unchanged from existing code and remains 0.5.Pose predictions are assigned to ground truth poses based on the highest PCK and unassigned predictions are counted as false positives.AP for each joint is computed and mAP is reported across the entire dataset.In all of our experiments we use leave-one-out cross validation, with 28 cross validation folds in total.Each fold contains video clips with the same source video.The presented results are the average performance across all 28 validation folds.In Table II we show the mAP at the highest Multiple Object Tracking Accuracy (MOTA) score (defined in the next section) for each model.We see that with our recursive heatmap strategy we are able to obtain higher average precision across the different joints in the hand.In Fig. 7 we show qualitative examples of our hand pose estimation on various frames from our Surgical Hands dataset.The top row clips are sampled from the best performing clips, while the bottom row are from the worst performing clips based on MOTA score.We see that the model suffers most in cases of heavy occlusion, where the camera view excludes the majority of the hand.Ambiguity in the position of the hand furthers the localization errors, e.g.top-down view with most fingers occluded.The best performing cases are those with balanced lighting and an unambiguous view of the first few digits.

C. Tracking Performance
In the previous section, we measured detection which emphasizes only localization accuracy across all frames.Tracking performances also takes into account consistency of the localized keypoints across the video.To measure tracking performance, we use Multiple Object Tracking Accuracy (MOTA) scores.The MOTA metric is part of the CLEAR MOT metrics [34] and is also included in the PoseTrack metrics.It is defined as: This encapsulates many errors that may occur during multiple object tracking: false negatives (F N ), false positives (F P ), and identity switches (IDSW ).F N are joints for which no hypothesis/prediction was given, F P are all the hypothesis for which no real joints exists, and IDSW are all occurrences in which the tracking id for two joints are swapped.G represents the total number of ground truth joints.The range of values for the M OT A score is (−∞ to 100].
The four methods we experiment with are: IoU overlap, L2distance, GCN, and GCN-joint visual.IoU overlap is the most common method, also known as spatial consistency.It involves measuring the overlap between the bounding boxes at the previous frame and the current frame.The detections with the highest overlap is selected as the same object.L2-distance is measured as the average L2 distance of the regressed keypoints between detections at time t − 1 and t.GCN measures the embedding similarity between the encoded keypoints at different time steps, the embeddings closest in distance are selected as matches.We experiment with GCN-joint visual which also measures embedding distance, except we use our pose-visual embeddings from each object.
We show quantitative results from our experiments in Table IV and the per-joint performance in Table III.Each row is maximized for the highest MOTA score across all hyperparameters, shown along with its corresponding mAP.We found GCN-joint visual to perform slightly better than GCN on our model, but still lower than IoU and L2.We do not include it as a comparison in our final table but leave it in the Appendix.Our method has a marginally higher MOTA score across all of the videos, but our corresponding mAP scores are greater by a much larger margin.This points to our advantage from temporally leveraging predictions from previous frames during the detection step.We show an example from a clip in Fig. 6, in a frame-by-frame comparison between the baseline and our method, we note a higher recall and improved localization.In this case the last digit is obstructed, but its position can be Fig.6: We show a qualitative comparison between the baseline model and our method.We note a higher recall and consistency between frames, as shown for the hand to the left.Even when the pinky finger is not visible, the past predictions helps reinforce those joint locations.reasonably inferred.From our annotations, this is labeled as occluded.Both methods make some prediction about this digit initially, but our model can use past predictions to reinforce its confidence while the per frame prediction may be noisy.
In top-down methods, the performance of tracking is heavily dependent on the accuracy of bounding box detections.Our earlier experiments evaluated the cross-validation splits independent of an object detector, using the ground truth annotations for the image crops.Here we use hand object detections as input to the tracking pipeline.We use a Faster-RCNN detection architecture trained on the 100 Days of Hands (100DOH) dataset provided by Shan et al. [39].The results are shown in the last column of Table IV.As expected, the localization and tracking accuracy are lower than when using the manually annotated bounding boxes as input.However, the trends from both experiments remain consistent.The quality of the detected bounding boxes serve as a bottleneck to the performance of the entire system.We sample clips from the best and worst performing in Fig. 7.

D. Ablation Analysis
To evaluate the efficacy of each part of our model, we perform an ablative analysis on the convolutional feature map and the attention mechanism.We experiment with three different modifications to our network: no prior convolutional feature map, no attention mechanism, and removal of both.We show the results of these experiments in Table V.
As expected, the full model has the highest scores overall but the model variants are less straight forward.The addition of the attention mechanism and convolutional feature map seem to have opposing effects on the mAP and MOTA scores, respectively.The NC model does not use a convolutional feature map from frame t to gather context, so the fusing module is applied directly to both the un-modified heatmaps from t − δ and t.We found that this increases the mAP value, but lowers the MOTA score.For the NA model, we directly concatenate the convolutional features and the heatmaps, with no attention mechanism.We see that this has the opposite effect, decreasing the mAP significantly but slightly increasing the overall MOTA score.When directly using the prior without the convolutional features for context (NC and NC-NA models), we hypothesize that the model can still learn to use the prior prediction and improve its detection (mAP) score.And even more so with the attention mechanism, where it can still weigh certain joints that are more likely to improve pose prediction.Here the NA model brings a drop in mAP, which may be attributed to an unrefined prior with noisy features.Strangely, the NA model shows a small increase in the MOTA score.The MOTA metric counts tracking errors produced by the model which in our experiments were shown to be heavilyweighted by false positives.Given the lower mAP score by the NA model and the smaller margin in the MOTA metric, it is likely that the small improvement comes from fewer false positives produced by that model.
We also explored the value of our hyperparameter, δ, during training.The prior is selected from frame t − δ, the higher δ lends itself to a longer temporal range.We train with values δ = {1, 2, 3, 4} and show our results in  but a much higher mAP (58.64 vs 56.66).It is important to note that in our experiments hyperparameters are tuned to maximize the MOTA performance only.We also find a nonlinear correlation between the mAP and MOTA scores, there is often a trade-off in mAP when optimizing for the tracking performance.Therefore the best strategy is one that maximizes MOTA accuracy with minimal loss in localization precision.

E. Evaluation on Human Pose
We executed additional experiments on the PoseTrack18 dataset between our model and the baseline.The baseline model, in the author's work, also includes optical flow during tracking evaluation.The optical flow is used only to shift the detected bounding boxes, it is not incorporated into the pose estimation model itself.Hence we use our re-implementation and add the same tracking to both.We repeat this for our dataset in the Appendix.
In Fig. 8, we show a narrowed gap in performance but our findings are consistent with our earlier experiments.We perform a grid search on hyperparameters and optimize for maximum MOTA.For the top scoring models, we have similar or higher mAP on equivalent MOTA values.Given the tradeoff that occurs between mAP and MOTA, this means our model is more likely to retain its localization precision at higher tracking accuracies.

VI. DICUSSION
In this work, we have introduced Surgical Hands, the first articulated multi-hand pose tracking dataset to be used in evaluating hand tracking methods.We collected data specifically to be used in the medical domain from publicly available videos of real surgical procedures.Additionally we introduced Res152-CondPose, a novel network that makes conditional hand pose predictions by incorporating past observations as priors.We show that when compared with a frame-wise Fig. 8: Optimized for maximum MOTA score, we show the top performing models on PoseTrack18.Consistent with our earlier findings, our model has a higher mAP for comparable MOTA scores.We use our re-implementation of the optical flow track to include in the matching strategy.
independent strategy, we have better performance in detecting and tracking hand poses.Moreover, we see that our model has a much greater impact on the localization accuracy.
When capturing the hand pose in videos, tracking drives the consistency of the object and joints through time.But the actual shape and characteristics of the hand is described by the localization precision.With a higher localization precision and better tracking still, we can guarantee a better representation of the hands in the scene.
While not the focus of this work, tracking hands can lead to other useful applications in the surgical domain such as technical skill assessment or temporal action detection.
With a reliable hand tracking method, we can provide a salient signal that can be used to approximate skill or understand actions.

Fig. 1 :
Fig.1: We show a real example from Surgical Hands.On the left, a method only performing frame-wise independent predictions may miss out on properly localizing joints, while on the right, temporally passing past predictions from previous frames improves the network's localization.

Fig. 3 :
Fig. 3: Samples from crowd-sourced annotations.For each hand we provide a bounding box, a left/right hand class label, tracking id, and keypoint annotations for visible and occluded joints.

Fig. 7 :
Fig.7: (Best viewed in color).These qualitative examples are generated using bounding boxes from an object detector.We sample frames from the best performing folds (top row) and lower performing folds (bottom row).Each digit is colored differently to assess the prediction quality and accuracy.
1 t−1 , o 2 t−1 , . . ., o n t−1 }, and m hands at time t, {o 1 t , o 2 t , . . ., o m t }, we first use a similarity function to derive similarity measures between each pair at t − 1 and t.Common methods are intersection-over-union (IoU) of the bounding boxes, average L2-distance of the predicted joint locations, or L2-distance between the graph pose embeddings.

TABLE II :
Mean Average Precision (mAP).Performance is averaged across all folds

TABLE III :
We optimize for the Multiple Object Tracking Accuracy (MOTA), each performance metric is averaged across all validation folds

TABLE IV :
MOTA performance between matching strategies, averaged across all folds.Each row is optimized for highest MOTA performance.Matching strategies share the same base model, so it is possible for them to share the same mAP score.

Table VI .
In our experiments, we are optimizing for the highest MOTA score and we found that to be δ = 3 with 39.31.The second highest is δ = 1 with a slightly smaller MOTA score (39.03),

TABLE VI :
Effect of δ.Each model is trained with a separate δ value