Robust hand tracking for surgical telestration

As human failure has been shown to be one primary cause for post-operative death, surgical training is of the utmost socioeconomic importance. In this context, the concept of surgical telestration has been introduced to enable experienced surgeons to efficiently and effectively mentor trainees in an intuitive way. While previous approaches to telestration have concentrated on overlaying drawings on surgical videos, we explore the augmented reality (AR) visualization of surgical hands to imitate the direct interaction with the situs.

Methods

We present a real-time hand tracking pipeline specifically designed for the application of surgical telestration. It comprises three modules, dedicated to (1) the coarse localization of the expert’s hand and the subsequent (2) segmentation of the hand for AR visualization in the field of view of the trainee and (3) regression of keypoints making up the hand’s skeleton. The semantic representation is obtained to offer the ability for structured reporting of the motions performed as part of the teaching.

Results

According to a comprehensive validation based on a large data set comprising more than 14,000 annotated images with varying application-relevant conditions, our algorithm enables real-time hand tracking and is sufficiently accurate for the task of surgical telestration. In a retrospective validation study, a mean detection accuracy of 98%, a mean keypoint regression accuracy of 10.0 px and a mean Dice Similarity Coefficient of 0.95 were achieved. In a prospective validation study, it showed uncompromised performance when the sensor, operator or gesture varied.

Conclusion

Due to its high accuracy and fast inference time, our neural network-based approach to hand tracking is well suited for an AR approach to surgical telestration. Future work should be directed to evaluating the clinical value of the approach.

Using hand pose estimation to automate open surgery training feedback

Article 30 May 2023

HMD-EgoPose: head-mounted display-based egocentric marker-less tool and hand pose estimation for augmented surgical guidance

Article 14 June 2022

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Death within 30 days after surgery has recently been found to be the third-leading contributor to death worldwide [1]. As a large portion of deaths can be attributed to human failure, surgical training is of the utmost socioeconomic importance. In this context, the concept of surgical telestration has been introduced to enable experienced surgeons to efficiently and intuitively mentor trainees during surgical training [2]. In the context of laparoscopic surgery, the key idea is to offer the mentor the ability to highlight/point at anatomical structures within the surgical video to guide the trainee during the surgery. Intuitively one would expect senior surgeons to physically point at anatomy directly on the display, but this comes with several restrictions regarding hardware setup, surgical workflow as well as sterilization requirements, because the mentor would need to be able to touch the monitor that the trainee is looking at. As a result, this approach is almost never used in practice. More recent computer-assisted alternatives for surgical telestration [3,4,5] typically require the mentor to operate on a separate computer system to draw simple overlays (e.g., lines, circles) onto the video seen by the trainee (Fig. 1). While this approach can even be used for remote training, it is less intuitive, slower and challenging to implement in the presence of high organ deformation.

To exploit the benefits of both approaches, we investigate a concept in which the hands of the experienced surgeon are continuously monitored and transferred as an augmented reality (AR) overlay onto the laparoscopic video (patent pending [6]) (Figs. 1 and 2). The mentor is then able to be in direct interaction with the surgical trainee and provide intuitive guidance via hand gestures seen by the trainee on the surgical video. The concept is also applicable in remote settings because the mentor does not need to be present in the OR.

Key to the performance of this new concept of surgical telestration is an automatic accurate, robust and fast hand tracking. In this paper, we present the first hand tracking pipeline specifically designed for the application of surgical telestration (Fig. 2). In contrast to related approaches in the surgical domain, that focus on coarse hand and object tracking, localization and pose estimation [7,8,9], our method simultaneously outputs a fine-grained hand segmentation as an important prerequisite for the AR overlay. We further perform a comprehensive validation of the method based on a surgical data set of unprecedented size, comprising more than 14,000 images and reflecting application-specific challenges, such as changes in skin and surgical glove color.

Material and methods

Our approach to surgical telestration is depicted in Fig. 2. A camera captures the hands of the mentor who observes the (training) operation. The camera data are processed by a two-stage neural network, which outputs (1) a skeleton representation comprising positions of 21 keypoints as a semantic representation of the hand and (2) a segmentation of the hand for AR visualization in the field of view of the trainee. The skeleton representation is used as an auxiliary task for the hand segmentation as well as for long-term analysis of the application. In this paper, we focus on the hand tracking module, which is key to the performance of our approach. The underlying data set is described in “Data set” section. Our approach is based on the coarse localization of the hand via a bounding box (“Real-time hand localization” section) and the subsequent extraction of the skeleton (“Real-time skeleton tracking” section) and the hand segmentation (“Real-time hand segmentation” section).

Data set

Data acquisition

The data for development and initial validation of the hand tracking methodology were acquired at the Heidelberg University Hospital using a Real Sense D435i camera (Intel; Santa Clara, USA) in a surgical training setting. We acquired a total of 14,102 images on 66 days between March 2020 and April 2021, featuring a variety of different hand poses, lighting conditions as well as hand and glove colors. Our data set comprises approx. 46% light skin, 22% blue glove, 11% white glove, 11% green glove, 8% brown glove and 2% dark skin. As the telestration concept should also be applicable in settings with unpredictable background, we varied the latter, specifically with respect to the objects present. While we allowed for multiple hands to be present in the field of view of the camera, our concept assumes one primary hand used for telestration by the mentor.

Data annotation

In the acquired images, a medical expert annotated the primary hand of the expert by setting 21 keypoints representing the skeleton of the hand, as suggested in [10] and shown in Figs. 2 and 3. Additional metadata annotations include handedness (left, right), the skin (light, dark) or glove color (brown, blue, green, white) in the presence of the latter. A subset of 499 images^{Footnote 1} was then extracted, and a medical expert complemented the skeleton annotations with segmentations of the entire hand, as illustrated in Figs. 2 and 3.

Data set split

We split our data into a proportion of 80:20 for training (including hyperparameter tuning on a validation set) and assessment of performance. We note that videos taken on the same day are not necessarily independent due to comparable hardware setups. Therefore, we ensured that no data from the same day are present in both training and test set. To prevent data leakage, the segmentation train/test set is a subset of the corresponding keypoint train/test set. To achieve an adequate train/test split, we sampled days for the test set randomly, until a proportion of 20% was reached. We repeated this procedure once for the part of the data set with segmentation annotation and once for the part without. As we expect the measurement days to be a major contributor to the variance in the data set, we want to make sure that a sufficient number of measurement days in the test set. To guarantee this, we excluded the three measurement days with the most measurements from the test set. To split the training data into a training and validation set, a similar procedure was followed, but with only 10% of training data used for the validation set. For skeleton tracking, this resulted in a data set size of 11,541 as training set (including validation) and 2561 as the test set. For the segmentation task, a total of 395 images served as training set (including validation), the remaining 104 images served as test set. The validation data set was used to optimize the processing pipeline (see A.1); the test data set was used for performance assessment.

Real-time hand localization

Inspired by the MediaPipe model [10], we use a bounding box detection step prior to both skeleton tracking as well as segmentation. Our reference bounding boxes were derived from the skeleton points by first constructing a tight bounding box that encloses all skeleton points and then enlarging this box by a factor of two and squaring it. Using the cropped image as input for the skeleton extraction and segmentation ensures a relatively constant size of the hand and enables us to phrase the skeleton extraction task as a regression task based on the 21 keypoints. While MediaPipe uses a single shot palm detector [10], we apply YOLOv5s [11] as our detection model to predict the bounding boxes as we identified YOLOv5s being a good compromise between accuracy and speed [12]. The predicted boxes are post-processed using the Non-Maximum-Suppression (NMS) algorithm with an Intersection over Union (IoU) threshold of 0.5. Unlike Mediapipe, where bounding boxes are only inferred if a detection is considered lost, we employ the bounding box model continuously. This is possible due to the short inference times of the YOLOv5s model in conjunction with our hardware setup. The box with the highest confidence score is used in the downstream tasks.

We use the training procedure as presented in the official implementation [11] with a stochastic gradient descent (SGD) optimizer, learning rate of 0.01 and binary cross-entropy as the loss function. The augmentations used for training stem from the official implementation, namely: hue, saturation, value, and mosaic augmentations, and horizontal flips. We save the weights based on the epoch with the best mean average precision (mAP).

Real-time skeleton tracking

Our skeleton tracking architecture operates on the cropped images generated by the bounding box model. We use an EfficientNet B3 [13, 14] loaded with Noisy Student pretrained weights [15] as the backbone for our regression model with an L1 loss for optimization. The model is trained using the Adam optimizer with an automatically set learning rate [16]. We save the model weights based on the mean regression accuracy on the validation set, which is the mean Euclidean distance between annotated reference and model prediction of the skeleton joints. During training, we use random offsets and alter the size of the reference bounding box to account for the fact that the regression and segmentation models will not be provided with perfect bounding boxes at inference time. In addition, we use the following augmentations implemented in the Albumentations library [17] namely: brightness and contrast, RGB channel shuffle, RGB shift, hue saturation, Gaussian noise, blur, rotation. The aforementioned augmentations are activated with per sample probability \(p=0.15\).

Real-time hand segmentation

Baseline model

As for the hand skeleton, we utilize cropped images based on bounding boxes for the segmentation. Our segmentation model uses a Feature Pyramid Networks FPN [18] with an EfficientNet B3 encoder loaded with Noisy Student pretrained weights [15] as a backbone and is trained by optimizing the binary cross-entropy loss using an adam optimizer with a learning rate of \(10^{-4}\). We utilize the same augmentations in the Albumentations [17] as for the skeleton training.

Model with auxiliary task

In a variant of this approach, we use the skeleton tracking task as an auxiliary task for our model. To this end, the 21 keypoints regressed by our skeleton tracking model are used as additional input (one channel per keypoint) of the hand tracking module. Each channel contains a two-dimensional isotropic Gaussian centered at the keypoint location and a standard deviation of 5 px. To account for the different input channels, while being able to utilize pretraining, we add three CNN layers that merge the RGB with the Gaussian input prior to the actual backbone.

The primary purpose of our study was to assess the accuracy, robustness and speed of our hand tracking pipeline in the surgical training setting. Specifically, we investigated the regression accuracy for the keypoints making up the skeleton of the hand (“Real-time skeleton tracking” section), as well as the segmentation accuracy (“Real-time hand segmentation” section) set whose distribution was similar to that of the training set. In a second, prospective study, we assessed the generalization capabilities of our method, by including mentors, gestures and cameras that were not part of the training data (“Assessment of generalization capabilities” section).

Assessment of speed and accuracy

As an initial retrospective validation of our method, we determined the speed and accuracy for the skeleton tracking and hand segmentation using the test set of the data set described in “Data set” section. The workstation for the assessment of the inference time was equipped with a Nvidia Geforce RTX 3090 and an AMD Ryzen 9 3900X 12-Core Processor.

Real-time skeleton tracking

Experiments

To assess the skeleton tracking accuracy, we determined descriptive statistics over the mean keypoint distance on the test set, using MediaPipe [10] as our baseline, because it is widely used and was specifically developed for integration of hand tracking in third-party applications. The hierarchical structure of the data was respected by aggregating over individual frames of one acquisition day before performing the aggregation over all acquisition days. As the MediaPipe method struggled in the presence of dark skin and gloves, we (1) additionally assessed the performance grouped by skin/glove color and (2) divided the performance assessment into the steps of successful hand detection and the keypoint regression performance. We approximated a successful hand detection by comparing the center of gravity of the regressed skeleton joints with the corresponding center of gravity of the reference skeleton joints. A distance below 100 px was regarded as a match. To enable a fair comparison, we compensated for the fact that MediaPipe has no notion of a primary hand. To this end, we chose the bounding box out of the four bounding boxes with the highest confidence, that was closest to the reference bounding box. The analysis of the regression performance was then only computed for successful localizations. Note that there is a tradeoff between regression accuracy and detection accuracy depending on the threshold.