DisguisOR: holistic face anonymization for the operating room

Purpose Recent advances in Surgical Data Science (SDS) have contributed to an increase in video recordings from hospital environments. While methods such as surgical workflow recognition show potential in increasing the quality of patient care, the quantity of video data has surpassed the scale at which images can be manually anonymized. Existing automated 2D anonymization methods under-perform in Operating Rooms (OR), due to occlusions and obstructions. We propose to anonymize multi-view OR recordings using 3D data from multiple camera streams. Methods RGB and depth images from multiple cameras are fused into a 3D point cloud representation of the scene. We then detect each individual’s face in 3D by regressing a parametric human mesh model onto detected 3D human keypoints and aligning the face mesh with the fused 3D point cloud. The mesh model is rendered into every acquired camera view, replacing each individual’s face. Results Our method shows promise in locating faces at a higher rate than existing approaches. DisguisOR produces geometrically consistent anonymizations for each camera view, enabling more realistic anonymization that is less detrimental to downstream tasks. Conclusion Frequent obstructions and crowding in operating rooms leaves significant room for improvement for off-the-shelf anonymization methods. DisguisOR addresses privacy on a scene level and has the potential to facilitate further research in SDS. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-023-02939-6.


Introduction
The past years have seen an increase in video acquisitions in hospitals and surgical environments.In the field of surgical data science (SDS), the analysis of endoscopic and laparoscopic frames is already an established research direction [1].It aims to build cognitive systems capable of understanding the procedural steps of an intervention, for example, recognizing and localizing surgical tools [2].Closely related to the endoscopic frames are videos from externally mounted cameras, capturing the surgical scene from an outside perspective [3].These rich information sources build the foundation for analyzing and optimizing the workflow, essential for developing context-aware intelligent systems, improving patient quality of care, and advancing anomaly detection.However, video recordings of surgeries are still considered problematic due to strict privacy regulations established to protect both patients and medical staff.As manually anonymizing video frames no longer becomes feasible at scale, it is imperative to develop automatic de-identification methods to advance future research and facilitate SDS dataset curation.
Surgical operating rooms are frequently crowded and packed with medical equipment.Cameras can only be mounted at particular positions, leading to perspectives not usually found in conventional datasets [4].This poses challenges even for advanced anonymization methods, as they tend to perform poorly under partial occlusions and obscure camera angles [5].A few methods address the specific challenges of OR anonymization [5,6] from individual cameras.Recent works propose addressing the OR's unique challenges by combining multi-view RGB-D data (multi-view ) to compensate for missed information in surgical workflow recognition [3,[7][8][9].The existence of such multi-view OR recordings requires anonymizing all camera views, as a failed anonymization in a single view breaches the privacy of the entire scene.
We propose a novel anonymization approach for multi-view recordings, which leverages 3D information to detect faces where conventional methods fail.We utilize a 3D mesh to accurately replace each detected person's face, preserving privacy as well as data integrity in all camera views.In Figure 1, we compare single and multi-view approaches, highlighting the advantages of scene-level anonymization.We additionally show that in comparison to existing methods, our face replacement yields images that harmonize well with the surgical environment as measured by image similarity.Our main contributions can be summarized as follows: • We present a novel framework for accurate multi-view 2D face localization by leveraging 3D information.We further emphasize the necessity for consistent anonymization across all camera views using our proposed holistic recall.• We present a training-free, mesh-based anonymization method yielding complete control during the 3D face replacement step while generating more realistic results than existing state-of-the-art approaches.• The images anonymized by our framework can be effectively utilized by downstream methods, as shown through experiments on image quality assessment and downstream face localization.

Related Works
Face detection.With the advent of public face detection benchmark datasets like WIDERFACE [4], numerous deep learning-based face detectors were introduced in recent years [11][12][13].Such methods typically regress a bounding box onto the region where a face can be successfully identified in the image.
As WIDERFACE consists of annotated images from everyday scenarios, face detectors trained on this dataset can suffer in complex and crowded OR environments [5].Occlusions and obstructions from medical equipment or personnel in close quarters, masks, and skull caps can lead to missed predictions and, ultimately, incomplete anonymizations.While we also use 3D data for anonymization, our work diverges from 3D face recognition [14], where a scan of a 3D face is matched to a catalogue of face scans.Image anonymization.Identity scrubbing can be achieved by removing the sensitive area, blurring, or pixelization [3].In the OR, standardized scrubs and gloves already obscure many possible landmarks, leaving the face as the primary identifier that could be used for re-identification, as previously established [5,6].A recent line of work has proposed to replace faces with artificially generated faces using GANs [10,15] or parametric face models [16].Such replacement methods tend to yield a more realistic-looking output, and the resulting anonymized area resembles the input more closely, which can positively affect downstream applications [15].However, these methods typically contain a separate branch to handle face detection [10] and thus suffer similarly in OR environments due to partial obstructions.Human Pose Estimation.Using human pose estimation as an additional context to localize faces has been demonstrated as valuable [5].The torso, shoulders, and arms provide useful cues for localizing faces occluded under a surgical mask and skull cap.Beyond mere 2D human keypoint detection, a significant emphasis has also been placed on regressing keypoints from multiple camera views in a shared 3D space [17,18].3D human pose detection can be especially beneficial for multi-person scenarios such as surgical ORs, where ubiquitous occlusions can lead to poor performance in individual camera views [19,20].Regressing a 3D human shape from a single input image is also an active area of research [21].However, such methods would suffer similarly to partial occlusions.To avoid this shortcoming, we leverage the 3D nature of multi-view OR acquisitions.

Methods
An overview of our proposed method DisguisOR is shown in Figure 2. We use the multi-view OR dataset introduced in Bastian et al. [7] depicting veterinary laparoscopic procedures, expanded to all four cameras available in the acquisitions.Each camera's color and depth images are combined into a colored 3D point cloud using the Azure Kinect framework.Subsequently, the four partial point clouds are registered into one global coordinate space by minimizing the photometric reprojection error over keypoints on a large visual marker.Our pipeline thus uses RGB images and depth maps from each camera, along with a fused point cloud of the entire scene as input.Multi-person 3D Mesh Regression.We adopt an unsupervised 3-stage approach to fit a 3D mesh [22] for each person in the scene.2D human keypoints are first detected [23] in each camera view, and regressed in a global coordinate frame with VoxelPose [18].As neither 2D nor 3D human poses are available as ground truth, we use an existing detector [23] trained on COCO to estimate 2D human keypoints from an image.In order to combine poses from each view in a robust manner, Voxelpose must first be trained to learn how multiple 2D poses from each camera can be optimally combined in 3D.To achieve this, we follow the procedure described in [18] and synthetically generate ground truth by sampling existing 3D human poses from the Panoptic dataset [24] and placing them at random locations in the 3D space.These poses are then projected back into each 2D image-plane and used as input to guide VoxelPose through the 2D-to-3D multi-person regression task.The trained model ultimately combines 2D human poses from multiple views into one joint 3D human pose for each person in the scene.We then perform an additional temporal smoothing on each 3D human-pose sequence to interpolate missing poses and reduce noise (for details, see suppl.).3D Human Representation.In order to adequately represent the face of each individual in the scene, we propose to use the statistical parametric human mesh model SMPL [22], which we regress onto each 3D human pose obtained as output from Voxelpose.While temporal smoothing yielded less noisy keypoint estimates, we noticed that the 3D mesh model did not always align with the 3D point cloud of an individual, resulting in inaccurate face localizations.To resolve this issue, we perform a rigid registration between the head segment of the SMPL model and the point cloud.More specifically, we crop the point cloud around the estimated head of the SMPL model and align the model with the point cloud using the probabilistic point-set registration method FilterReg [25], and subsequently fine-tune using iterative closest point (ICP) [26].As a final step, we extract the face from the SMPL mesh, which should now be aligned with the 3D location of an individual's face.
Rendering the faces in 2D.Thus far, our pipeline estimates a global 3D face mesh that overlaps with each person in the scene.In order to yield a 2D anonymization, these meshes can now be projected back into all camera views, replacing the face of each individual with a unique template (see Figure 5).However, a 3D face might be occluded in a particular view, for example, due to an OR light and, therefore, not visible.To mitigate false-positive predictions, we check whether a 3D face is visible in 2D by looking for a disparity between the camera's depth map and the 3D face mesh (for details, see suppl.).We then utilize the Poisson Image Editing technique [27] to harmonize the face template and the background image for a more natural appearing face replacement.The template can also be changed for each individual to influence factors such as age, sex, or ethnicity.

Ground Truth Curation
We identified three distinct scenarios of varying difficulties from the complete dataset [7].We then annotated each visible face manually in all camera views for a total of 4913 face bounding boxes.The annotation criteria were adopted to closely match the style of the WIDERFACE dataset [4].The three scenarios are chosen to specifically represent the varying characteristics in the OR.They differ in the number of individuals present, their attire, and the degree of obstructions (Figure 3).Face Localization.We compare the proposed method's face localization performance with that of DSFD [11], a state-of-the-art detector also used in DeepPrivacy [10].We use the model pre-trained on WIDERFACE [4] provided by the authors.We additionally evaluate the self-supervised domain adaption (SSDA) strategy proposed by Issenhuth et al. [5].Here we also use DSFD as the face detection backbone, fine-tuning it on 20k unlabeled images as proposed, with the suggested hyperparameters.
In addition to recall, we propose to evaluate multi-view OR anonymization with what we coin holistic recall.The holistic recall considers a face as detected only if it was identified in all camera views where it is at least partially visible.We argue that this is more suitable than image-wise evaluation, as a missed detection of a face in a single view results in a breach of anonymization for that individual.
We calculate the smallest rectangle outlining the rendered mesh to generate face predictions for evaluation.As the proposed method does not rank the output detections with a confidence score, the commonly used average precision (AP) score is not defined.Therefore, we additionally report precision and F1-score for all three methods in the supplementary materials.Furthermore, the four cameras are categorized as either a surgical camera (SC) or workflow camera (WFC), depending on the perspective of the camera (see Figure 3).The images and angle of acquisition in WFCs are more similar to what might be found in public face detection datasets [4], while SCs may acquire the scene from above, and individuals are more frequently obscured by OR equipment.Image Quality.We compare the images anonymized by our approach to those altered by several conventional anonymization methods, such as blurring (61x61 kernel), pixelization (8x8 pixels), blackening, as well as the established GAN-based model DeepPrivacy [10] (see 2D anonymization Figure 1).To disentangle image quality and face detection, we only evaluate image quality on faces detected by both our method and Deep Privacy, totaling 3786 faces.We evaluate the effectiveness of our face replacements on the cropped groundtruth bounding boxes with three established image quality metrics.The fréchet inception distance (FID) [28] measures the overall realism by calculating the distribution distance of the original and generated set of images.Learned perceptual image patch similarity (LPIPS) [29] reflects the human perception of an image's realism by computing the difference between activations of two image patches for a standard neural network.The structural similarity index measure (SSIM) [30] calculates the quality of an image pixel-wise based on luminance, contrast, and structure.
Finally, we conduct additional experiments on the downstream behavior of off-the-shelf methods on our anonymized faces (see suppl.)

Face Detection
Figure 4 depicts the performance of our proposed method in comparison with two existing baselines.In the easy evaluation scenario, both DSFD and Dis-guisOR perform comparably, while the SSDA achieves a 9% higher holistic recall.In the medium and hard scenarios, DisguisOR outperforms DSFD and SSDA in holistic recall by 10% and 3%, and 11% and 16%, respectively.These disparities are largely due to a poor detection rate in the surgical cameras, which are acquired from unusual camera angles and contain frequent obstructions (Figure 3).DisguisOR is able to better cope with the increased occlusions and the number of individuals present in these scenarios, highlighting the proposed method's robustness under partial visibility.By combining information from multiple cameras, DisguisOR yields a geometrically consistent detection -if an individual face has been accurately localized in the Fig. 4 Face Localization Performance of DSFD [11], the Self-Supervised Domain Adaption (SSDA) Method of [5], and DisguisOR Over All Scenarios.The holistic recall considers a face as detected only if the same face was successfully detected in all camera views where it is at least partially visible.Both recall and holistic recall are reported for IOU@0.4 3D scene, it can be more consistently identified in each individual image.While DSFD achieves a significantly higher accuracy in the easy scenario when refined via SSDA, the human pose backbone of DigsuisOR underwent no such refinement and would likely also see some performance improvements.SSDA underperforms the baseline DSFD model in the hard scenario, as well as DisguisOR in both the medium and hard scenarios.This could be because these more challenging detection candidates are less represented in the training data distribution or not detected with a high confidence score, and thus not pseudo-labeled frequently enough.
The recall rates over individual cameras reflect the characterizations of surgical and workflow camera views.While DSFD generally achieves slightly higher recall rates on workflow camera views (WFC1, WFC2), DisguisOR achieves much higher recall rates in the surgical camera views (SC1, SC2), see Figure 3. SSDA improves recall for DSFD in surgical cameras, although it still falls short of DisguisOR in medium and hard scenarios.The surgical cameras in the hard scenario are especially challenging for face detectors, as severe occlusions, unusual camera angles, and surgical scrubs drastically impair the face detectors' detection rate.In the case of faces in SC1 of the hard scenario (see person 1 in Figure 1), DSFD achieves a recall rate of 16.9%.Using SSDA increases this recall rate to 52.8%, which DisguisOR still outperforms with a recall rate of 97.8%.
Our method is somewhat limited by the field-of-view (FOV) of the depth sensors during an acquisition.This partially explains the comparable performance with DSFD in the easy scenario, as individuals frequently move along the edge of the scene where depth coverage is limited.The 3D reconstruction we use to triangulate faces could also be performed without the use of the slightly more costly depth sensors, albeit less accurately.

Image Quality
In Table 1, we measure the quality of images altered by baseline approaches and our proposed method.As expected, conventional obfuscations like blackening, pixelization, and blurring achieve inferior results across all three metrics.DeepPrivacy [10] is designed to generate synthetic faces instead of applying conventional privacy filters, explaining the improved results on all image quality metrics compared to the conventional methods.Our method further improves upon these results as the replacement of the face information can be precisely controlled, even enabling the replacement of people wearing masks without creating corrupted or unnatural faces.In Figure 5, we illustrate examples where DeepPrivacy replaces the face mask of a person with an unnatural mouth (e), while our method manages to blend the template and original image (f) more effectively.
!" #" $" %" &" '" Fig. 5 Faces From Our Dataset Anonymized With Different Methods.Two unaltered faces (a, d), faces anonymized with DeepPrivacy [10] (b, e) and anonymized with DisguisOR (c, f).Note that DeepPrivacy fails to incorporate face masks in its generated faces.Rendering faces with a texture that reflects the setting (in this case a texture with a mask) can yield more consistent face replacement results

Conclusion
Existing anonymization methods do not effectively leverage multi-view data as they consider individual views independently.OR cameras are frequently mounted in unconventional positions and therefore suffer from heavy occlusions, making multiple views essential for accurately acquiring details of a procedure.Our 3D face detection framework DisguisOR enables consistency over each camera, preventing missed detections in a single view that would breach an individual's anonymity.Therefore, we emphasize the use of scenelevel anonymization with our proposed holistic recall metric to consider the recall of faces detected jointly in all camera views.We validate our face detection approach based on recall on individual camera views as well as holistic recall, demonstrating that our method achieves state-of-the-art results under challenging scenarios and views.Furthermore, anonymization methods must balance the discrepancy between anonymizing data and retaining its downstream utility.We show that our framework reduces this discrepancy by yielding more realistic face replacements compared to existing methods.The modularity of our anonymization approach provides us with fine-grained control of the face replacement, allowing us to vary parameters such as age, gender, or ethnicity.Existing datasets could even be augmented with faces representing a broad demographic, combating bias induced by unrepresentative training sets.We are convinced that our method will facilitate further research by reducing the burden of manually annotating existing and future multi-view data acquisitions.
Fig. 1 Obstructed faces in the pointcloud and color image.The left image shows three face meshes in 3D, the right image shows the depth map of surgical camera 2. The two red rectangles reveal the positions of two face meshes in both images.Our method checks for obstructions by calculating the distance between the meshes' 3D locations and the inferred 3D locations from the depth map Fig. 2 Examples of faces from our dataset with their respective annotation.

Rendering: Poisson Image Editing
We further highlight the effect of different blending parameters of the poisson image editing image harmonization (Figure 3).As an additional robustness towards privacy preservation, one may also opt to blur in the source image prior to blending (Figure 4) or avoid blending altogether (Figure 3 Alpha: 1.0).

Face Localization
In Tables 1, 2 and 3, we present the precision (P), recall (R), F1-score (F1), and number of annotated faces of each individual camera and scenario.The precision of DisguisOR is lower for some scenarios than that of DSFD, notably in the surgical cameras (SC1 and SC2).This is mainly due to the hardware constraints of the Kinect camera system, which does not generate a corresponding depth value for each pixel.Cropping to the depth field of view and downsizing the color image to the resolution of the depth camera would mitigate these issues and likely favor DisguisOR -however, we perform inference for all methods at native resolution of 2048x1536.While the self-supervised domain adapation (SSDA) [1] with DSFD increased the recall significantly, especially in the surgical cameras, it also decreased the precision, resulting in a lower F1-score.
Table 1 Comparison of precision (P), recall (R) and F1-score (F1) at IOU@0.4 of DSFD [2], the Self-Supervised Domain Adaption (SSDA) method of [1], and DisguisOR on the Easy Scenario across all cameras.The last column depicts the number of ground truth faces in each camera view DSFD [2] SSDA [

Different Confidence Thresholds
Table 4 depicts the precision, recall and F1-scores of DSFD [2] using different confidence thresholds.The default confidence threshold is 0.5.We average each metric across all four cameras, and present the results for each scenario.
Table 2 Comparison of precision (P), recall (R) and F1-score (F1) at IOU@0.4 of DSFD [2], the Self-Supervised Domain Adaption (SSDA) method of [1], and DisguisOR on the Medium Scenario across all cameras.The last column depicts the number of ground truth faces in each camera view DSFD [2] SSDA [ Lowering the confidence scores increases the recall, but at a significant cost of precision.Even at lower confidence thresholds, DSFD does not attain the recall of DisguisOR.Furthermore, errant detections cover large swaths of the images (Figure 5).Subsequent anonymization s.pdf could impact the image integrity and affect downstream tasks.

Experiments on Downstream Tasks
Face Detection.To evaluate the preservation of image integrity, we compare our anonymization method with the three conventional anonymization methods (blurring, pixelization and blackening) and DeepPrivacy [3] on downstream face detection.This experiment is designed to measure the degree to which an anonymization method creates unnatural alterations, which would lead to errant predictions from an existing method.We compare the performance of the pre-trained DSFD [2] model on images from all three scenarios anonymized through several methods.For a fair comparison between the two methods, we only consider face detections made by both DeepPrivacy and DisguisOR.We report the percentage of average precision (AP) at an intersection over union of greater than 0.4 (IOU@0.4)retained with respect to the original unaltered images.
Human Pose Estimation.In order to further demonstrate the image quality preservation of our method, we evaluate the widely used AlphaPose [4] human pose estimator, pre-trained on COCO [5], to understand the effect of unnatural image artifacts on human pose estimation.We generate pseudo-ground truth by projecting the 3D joint positions of each generated SMPL mesh into each camera view.In accordance with previously established works [1], we use the percentage of correct keypoints (PCK) metric to measure how many keypoints were accurately predicted within a threshold.DeepPrivacy is omitted from this experiment to avoid an unfair bias in favor of DisguisOR due to how pseudo-ground truth is generated.
Face Detection.Table 5 depicts the downstream face detection performance of the face detector DSFD [2] on images anonymized through various obfuscation methods.We report the percentage AP retained compared to the AP achieved on the original unaltered images.The results show that blackening the face area removes a considerable amount of information, making it prohibitively difficult for a method to localize a face.Blurring or pixelization is less detrimental to the detection methods' performance.Nevertheless, the detection results of both methods are severely impacted by the blurring of face regions.This becomes even more evident in difficult scenarios.For all three conventional anonymization approaches, the performance drop is severe, and the AP for the face detection task is too low for most use cases.In contrast, our proposed method generates faces that the detection methods can identify with high accuracy.
In the easy scenario, we observe a decrease of merely 2% in DSFD's face detection AP, highlighting the preservation qualities of our method.DeepPrivacy is able to retain more AP in the medium and hard scenarios, most likely since it generates synthetic faces with distinct facial features.Using a maskless face texture further increases the retained AP by enabling face detectors to rely on more facial elements, such as the nose and mouth.However, this comes at a cost of image quality, as apparent by the experiments in the main paper.These results indicate that the realistic faces generated by DisguisOR could mitigate costly annotation and retraining due to an inferior anonymization method.Furthermore, they also indicate that a given template and source image harmonization may be more suitable for a certain application.
Human Pose Estimation.In Table 6 we report the PCK of the human pose estimator AlphaPose [4] on differently anonymized frames.Human pose estimators are less susceptible to limited modifications of the face area.Therefore, the deviation of the PCKs is less severe.Blurring, pixelization, and blackening of the faces can confuse the methods resulting in a decreased performance.It is interesting to see that the anonymization limited to the region of the head still has a measurable negative impact on the joint detection of the hip.This again highlights the importance of retaining the information and the need to anonymize without severe information loss.The PCK on images anonymized by our method is the highest, demonstrating the realism of our method.

Appendix
For reproducibility, we provide the GitHub repository link and python version for each method that we have used in this paper.All computations were performed on a computer with 64GB of RAM and an NVIDIA GeForce RTX 2080 Ti.DeepPrivacy needed approximately 2.44s for anonymizing each frame (i.e., 4 images),while DisguisOR needed approximately 6.47s for each frame (see Table 7).The majority of this time is spent on point cloud alignment and rendering, which could be made more efficient.More specifically, it needed 0.52s for 2D human pose estimation, 0.23 for 3D human pose estimation, 0.62s for human mesh estimation and 3.74s for registration and 1.36s rendering.The memory footprint of DeepPrivacy reached around 4.6GB, while DisguisOR used approximately 6.5GB.

Fig. 1
Fig. 1 Holistic Face Anonymization From Scene Level Representation.Four views of a multi-view OR acquisition are visualized on the left, highlighting person 1 in each view.Conventional methods detect and anonymize faces in each image individually (green boxes) through either blackening (BLN), pixelization (PXL), blurring (BLR), or face replacement (FRP) [3, 6, 10].Our framework (purple boxes) leverages multi-view information to localize faces in 3D, enabling consistent anonymization in all four images

Fig. 2
Fig. 2 Anonymization Pipeline of DisguisOR.Four RGB images (a) and a fused 3D point cloud as input.Human pose keypoints are first detected in individual views and fused in 3D for each person in the scene (b).Fitting of parametric human mesh model onto keypoints of each person in the scene (c).Further geometrical refinement of the mesh positioning (d) and face extraction (e).3D mesh texturization and back-projection to each camera view (f)

•Fig. 3
Fig. 3 Overview of the Three Scenarios and All Four Camera Views.We categorize the scenes by their complexity.Surgical cameras (SC) are characterized by persistent obstructions and unusual viewing angles.Workflow cameras (WFC) exhibit ordinary viewing angles with fewer obstructions

!Fig. 4
Fig. 4 Illustration of combining poisson image editing with blurring.Blurring the target image allows to further hide prominent background information in the result.The medical mask's contour of the original target blends seamlessly with the final output

! 2 Fig. 5 Fig. 6
Fig. 5 Detections of DSFD [2] with different confidence thresholds.While lowering the threshold of DSFD leads to additional correct predictions, this typically comes at the expense of errant predictions which cover large portions of the image

Table 1
Comparison of different anonymization techniques based on image quality metrics.An arrow depicts whether a smaller (down) or larger (up) value is more favorable with respect to each metric [30]odFID ↓ [28] LPIPS ↓ [29] SSIM ↑[30]

Table 3
Comparison of precision (P), recall (R) and F1-score (F1) at IOU@0.4 of DSFD[2], the Self-Supervised Domain Adaption (SSDA) method of[1], and DisguisOR on the Hard Scenario across all cameras.The last column depicts the number of ground truth faces in each camera view

Table 5
Percentage of face detection AP of DSFD[2]at IOU@0.4 for differently anonymized images with the AP on original images as baseline.Masked Texture denotes that blended templates contain medical masks (see Figure3), whereas Maskless Texture denotes that blended templates do not contain medical masks

Table 6
PCKh@0.5 results for AlphaPose[4]on differently anonymized images.Images are taken from all scenarios across all cameras