Facing privacy in neuroimaging: removing facial features degrades performance of image analysis methods

Background Recent studies have created awareness that facial features can be reconstructed from high-resolution MRI. Therefore, data sharing in neuroimaging requires special attention to protect participants’ privacy. Facial features removal (FFR) could alleviate these concerns. We assessed the impact of three FFR methods on subsequent automated image analysis to obtain clinically relevant outcome measurements in three clinical groups. Methods FFR was performed using QuickShear, FaceMasking, and Defacing. In 110 subjects of Alzheimer’s Disease Neuroimaging Initiative, normalized brain volumes (NBV) were measured by SIENAX. In 70 multiple sclerosis patients of the MAGNIMS Study Group, lesion volumes (WMLV) were measured by lesion prediction algorithm in lesion segmentation toolbox. In 84 glioblastoma patients of the PICTURE Study Group, tumor volumes (GBV) were measured by BraTumIA. Failed analyses on FFR-processed images were recorded. Only cases in which all image analyses completed successfully were analyzed. Differences between outcomes obtained from FFR-processed and full images were assessed, by quantifying the intra-class correlation coefficient (ICC) for absolute agreement and by testing for systematic differences using paired t tests. Results Automated analysis methods failed in 0–19% of cases in FFR-processed images versus 0–2% of cases in full images. ICC for absolute agreement ranged from 0.312 (GBV after FaceMasking) to 0.998 (WMLV after Defacing). FaceMasking yielded higher NBV (p = 0.003) and WMLV (p ≤ 0.001). GBV was lower after QuickShear and Defacing (both p < 0.001). Conclusions All three outcome measures were affected differently by FFR, including failure of analysis methods and both “random” variation and systematic differences. Further study is warranted to ensure high-quality neuroimaging research while protecting participants’ privacy. Key Points • Protecting participants’ privacy when sharing MRI data is important. • Impact of three facial features removal methods on subsequent analysis was assessed in three clinical groups. • Removing facial features degrades performance of image analysis methods. Electronic supplementary material The online version of this article (10.1007/s00330-019-06459-3) contains supplementary material, which is available to authorized users.


Introduction
Sharing participant image data can offer many benefits to neuroradiological research: a better understanding of diseases can be achieved by access to larger participant populations in combined multicenter datasets; researchers without access to their own data on a specific disease can still contribute to its understanding by using shared datasets; and methodological improvements can be stimulated by publicly shared benchmark datasets. However, for shared data, it is crucial to protect participants' privacy. Image files should not contain identifying information such as name, date of birth, or any national or hospital-based registration numbers. Such data are often saved in metadata or even filenames of magnetic resonance (MR) images and should be removed before sharing. Unfortunately, this is not enough to alleviate privacy concerns, since typical structural magnetic resonance imaging (MRI) provides good enough skin to air contrast and spatial resolution to perform facial recognition from a 3D-rendered version of the image, whether by the human eye or using facial recognition software [1][2][3][4][5]. Therefore, in addition to identifying metadata, it has been suggested that facial features should also be removed, and this has been widely embraced [6][7][8][9]. However, it is not yet clear whether the removal of the facial features affects subsequent measurement of quantitative indices of brain pathology.
Therefore, the current study assessed the impact of facial features removal (FFR) on clinically relevant outcome measurements. We selected three FFR methods that are publicly available, well documented, and have been used in data sharing initiatives [10,11]: QuickShear [12], FaceMasking [13], and Defacing [14]. We assessed their effects on clinically relevant outcome measures in three different diseases: normalized brain volumes (NBV) in Alzheimer's disease (AD), white matter lesion volumes (WMLV) in multiple sclerosis (MS), and tumor volumes (GBV) in glioblastoma patients.

Subject
Subjects in this study were obtained from three different dataset: for AD, a dataset from the ADNI study (http://adni. loni.usc.edu/) [15]; for MS, a multicenter dataset from the MAGNIMS Study Group (https://www.magnims.eu/) [16]; and for treatment-naïve glioblastoma patients, a clinical dataset from the PICTURE project collected in the Amsterdam UMC, location VUmc, in Amsterdam, the Netherlands. Primary studies were approved by the respective local ethics committee for all three datasets. A summary of the demographics is given in Supplementary Table 1.

Alzheimer's disease
Data used in the preparation of this article were obtained from the ADNI database. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).
From the ADNI1 dataset, we selected the subset of subjects who had a 3-Tesla (T) magnetization-prepared rapid acquisition gradient echo (MPRAGE) baseline MRI, which is a subset of the 562 subjects that are in the ADNI1 dataset [15,17]. This subset included in total 110 (23% female) subjects with an average age of 75 (range 60-87) years. This dataset included 39 healthy elderly controls, 52 patients with mild cognitive impairment, and 19 patients with AD.

Multiple sclerosis
For MS, a multicenter dataset of the MAGNIMS Study Group was previously used to study iron accumulation in deep gray matter [18] and lesion segmentation software performance [16]. The dataset consisted of 70 patients (67% female), scanned in six different MAGNIMS centers. On average, the age was 34.9 (range 17-52) years. The mean disease duration from onset was 7.6 (range 1-28) years and the disease severity was measured using the Expanded Disability Status Scale (EDSS) on the day of scanning; patients had a median EDSS score of 2 (range 0.0-6.5) [19].

Glioblastoma
For glioblastoma, a total of 84 (38% female) patients were selected from a cohort treated at the Neurosurgical Center of the Amsterdam UMC, location VUmc, Amsterdam, the Netherlands, in 2012 and 2013. On average, the age was 61.4 (range 21-84) years. All patients had histopathologically confirmed WHO grade IV glioblastoma. The preoperative MRI was made on average within 1 week before resection.

MRI procedure
In the MS and AD datasets, all imaging was performed on 3-T whole-body MR systems, and for imaging of the glioblastoma dataset on 1.5-and 3-T MR systems. The protocol for the AD dataset included a 3D T1-weighted sequence, while the protocol for MS included a 3D T1-weighted sequence, as well as a 2D fluid-attenuated inversion recovery (FLAIR) sequence. The protocol for glioblastoma contained a 3D T1-weighted post contrast-enhanced scan, 3D FLAIR, and 2D T2weighted and non-enhanced 2D T1-weighted sequence. In Table 1, more details are listed on data acquisition of the datasets.

Facial features removal methods
Three publicly available methods were selected: QuickShear [12], FaceMasking [13], and Defacing [14] (Fig. 1). For all three methods, default settings were used in this study. FaceMasking was applied on all MR modalities separately. QuickShear and Defacing can only remove facial features from 3D T1 images. To remove the facial features from the other images, the full 3D T1 image of each subject was registered to the other full images of the same subject, using FSL-FLIRT (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT [20]), with 12 degrees of freedom before applying the FFR methods. Using the resulting transformation matrices, the 3D T1 image without facial features was transformed to each of the other image spaces, and subsequently binarized and used as a mask to remove the face from the other images.

QuickShear
Starting from a user-supplied brain mask, QuickShear [12] uses two algorithms [21,22] to create a plane that divides the MRI into two parts. One part contains the facial features, and the other part contains the remainder of the head, including the brain. After finding this plane, the intensity of all voxels on the Bfacial features^side of the plane is set to zero.

FaceMasking
FaceMasking [13] deforms the surface of the face with a filter. In this study, the normalized filtering method was used, which is the recommended filter. The method first selects the boundary of the skull and the face, and registers the image volume to an atlas with annotated face coordinates; then, the identified face region of interest layer is normalized and filtered and, finally, transformed back to the original image.

Defacing
Defacing [14] uses an algorithm that calculates the probability of voxels being brain tissue or part of the face, based on 10 annotated atlases of healthy controls. Voxels that are labeled as part of the face and have zero probability of being brain tissue are considered to contain facial features, and their signal intensities are set to zero in order to remove the facial features.

Clinical research outcome measurements
For all three datasets, commonly used, previously validated automated methods were used to obtain clinically relevant outcome measures on the full images (i.e., images without FFR processing) as well as on all images after FFR. In the AD dataset, NBVand unnormalized brain volume (BV) were measured with SIENAX [23]. In the MS dataset, WMLV was measured by segmenting the lesions on the FLAIR images with the lesion prediction algorithm in the lesion segmentation toolbox (LST-LPA) software [25]. In the glioblastoma dataset, the GBV was measured by taking the union of the segmentation of the glioblastoma necrotic core and enhancing tumor generated by BraTumIA [26]. A short description of the methods is provided in the supplementary data.
To provide context to any observed differences between results from full images and images after FFR, reproducibility of SIENAX, LST-LPA, and BraTumIA was assessed. This was done by repeating the analysis on 10 native images per dataset, selected based on the results from the analyses of images after FFR to include in each case 5 images with large effects of FFR and 5 images with small effects of FFR.

Statistical analyses
First, we investigated whether the FFR methods would successfully process the data, and if the automated methods could successfully process the data after FFR. A method was considered to have failed on a particular input image if the method gave an error or no output. Images were not excluded if the output quality was considered bad by human observers. The percentages of images for which the FFR methods produced output and the percentages of images for which FFRprocessed images could be analyzed by the automated methods were calculated.
Next, the impact of the FFR on the outcome measures was evaluated. In order to allow a direct and fair comparison of metrics between FFR methods, only the subjects for whom all three FFR methods produced output and for which both the full images and all FFR-processed images could be analyzed by the subsequent image analysis method were included.

Volumetric analyses
The effect of FFR on volumes was evaluated by assessing changes in NBV and BV (AD dataset), WMLV (MS dataset), and GBV (glioblastoma dataset) in three different ways: in data distribution, variability, and systematic differences. First, to assess data distribution, histogram characteristics  (median, first and third quartiles, means, and standard deviation) were calculated for four images (1 full; 3 FFR-processed) and difference characteristics (mL and percentage difference) were calculated for 3 FFR-processed images compared with the full image, and scatter plots and Bland-Altman plots were made. Second, to assess variability in the data, whether random we analyzed intra-class correlation coefficient (ICC) for absolute agreement between volumes obtained from full and FFR-processed images [27,28] with the lower and upper bounds of 95% confidence interval [CI] in parentheses. Third, to assess systematic differences, two-tailed paired t tests were performed between volumes measured in full images and those obtained from each of the FFR-processed images, using a Bonferroni-corrected p = 0.05 as threshold for statistical significance.

Overlap analysis (MS and glioblastoma datasets)
In MS and glioblastoma datasets, we also compared voxelwise differences between the segmentations obtained with and without FFR, because the image analysis methods used in these datasets produce location-sensitive segmentations of the structures of interest. The full dataset is used as Bgold standard^and is compared with each of the three FFRprocessed datasets separately, quantifying spatial agreement using Dice's similarity index (SI) [29]: where TP, FP, and FN are, respectively, true positive, false positive, and false negative volumes. SI can range from 0 to 1 and SI = 0 means no overlap and SI = 1, a perfect overlap. We calculated the median and first and third quartiles of SI, FP, and FN.

Failure of pipelines
A simplified flowchart summarizing the study steps is shown in Fig. 2. An overview of the percentages of images for which the FFR methods and automated methods did not fail, i.e., executed without error and with output, is shown for each dataset in Table 2

Full image results
In all datasets, outcome measures obtained from the full images were in the expected range and showed expected distributions. The methods showed good reproducibility, as shown in Table 3

AD dataset
Both NBV and BV were affected by FFR processing, in terms of both variability and systematic differences. In Fig. 3, an example of effected SIENAX by FFR processing is given. Figure 4 a and b show scatter plots of NBV and BV for FFR-processed images versus full images in the AD dataset; corresponding Bland-Altman plots are provided in the supplementary section. These results suggest that FFR affected NBV variability more than BV variability, which is confirmed by the ICCs (

MS dataset
For WMLV, absolute agreement between FFR-processed images and full images was high, but there were small but significant systematic differences. In Fig. 5, an example of affected lesion segmentation by FFR processing is given. Figure 4c shows the WMLV scatter plot; corresponding Bland-Altman plots are provided in the supplementary section. The corresponding ICCs in Table 5

Glioblastoma dataset
GBV appeared to be the outcome measure that is most strongly affected by FFR, with low to poor agreement and systematic volume underestimation after FFR by FaceMasking. In   Fig. 6, an example of affected glioblastoma segmentation by FFR processing is given. The scatter plot in Fig. 4d, the corresponding Bland-Altman plots in the supplementary section, and the ICCs in Table 5    Overlap analysis

Glioblastoma dataset
The SI, FN, and FP of the glioblastoma segmentation are shown in

Discussion
When sharing MRI data between research institutions, it is crucial to protect the privacy of participants. In addition to removing identifying metadata from MRI, facial features should also be removed. The current study evaluated how three publicly available FFR methods affect clinically relevant imaging outcome measures in AD, MS, and glioblastoma as derived using commonly used automated methods. Our results showed that the commonly used FFR methods can lead to subsequent failures of automated volumetric pipelines. Moreover, FFR can lead to substantial changes-both random (low ICC) and systematic (significant differences)-in volumes obtained by automated methods. The observed differences in outcome measures between full images and images after FFR cannot be attributed to random variation of SIENAX, LST-LPA, or BraTumIA, because the reproducibility of those methods was high. The automated methods LST-LPA for WMLV and BraTumIA for GBV failed to successfully execute on multiple FFR-processed images. It should be mentioned that we applied the automated methods with their default settings and did not attempt to remedy the errors. We did, however, assess the failures and we suspect that the failures were related to image registration steps, because registration methods can be susceptible to (disease related) artifacts as recently highlighted by Dadar et al [30]. This recent study showed that registration used in the automated methods could have problems with higher levels of noise and non-uniformity in images and that head size could have an effect on registration methods. It is conceivable that if the face is removed or deformed, the level of noise and non-uniformity could change and lead to failures.
The possible importance of image registration in causing changes after FFR is further suggested by the higher variability of NBV compared with BV after FFR. To compute the for the pairwise comparison of volumes from FFR-processed and full images; volume differences between the full and FFR-processed images as mL and % differences (median [1st and 3rd quartiles]); and ICC (absolute agreement (lower-upper band of 95% CI)) between volumes from full and FFR-processed images AD n = (110) Normalized brain volume Only the images for which FFR was successful and for which the segmentation was successful are included ICC, intra-class correlation coefficient; n, number of subjects; AD, Alzheimer's disease Table 5 Lesion volume in the MS dataset and tumor volume in the glioblastoma dataset. From left to right, the table lists median [1st and 3rd quartiles] for volumes; Bonferroni-corrected p values for the pairwise comparison of volumes from FFR-processed and full images; volume differences between the full and FFR-processed images as mL and % differences (median [1st and 3rd quartiles]); ICC (absolute agreement (lower-upper band of 95% CI)) between volumes from full and FFR-processed images; Dice's similarity index between segmentation from full and FFR-processed images; false negative between segmentations from full and FFR-processed images; and false positive between segmentations from full FFR-processed images MS; lesion (n = 55) Volume (mL) NBV, SIENAX multiplies the BV (calculated in native subject space) by a volumetric scaling factor obtained from a linear registration of the brain image to a standard brain image, additionally using a derived skull image. FFR could affect the removal of non-brain tissue and identification of the skull, and thereby cause a different registration result, culminating in altered NBV values. Differences in shapes of the head and face between people (related to, e.g., sex or ethnicity) may affect performance of standard FFR algorithms which may have ramifications especially for subsequent analysis methods  that use the skull such as SIENAX. However, there are also cases in which the NBV was not affected by FFR, so maybe there is a cutoff on how much of the head can be removed without affecting the NBV measurement. In a further study, an optimum could be determined between the amount of facial features that should be removed for de-identification and the amount that should remain for correct analyses with automated methods. Therefore, it would be interesting to study in more detail the effect of FFR methods on registration and other processing steps in a systematic way. Milchenko and Marcus [13] and Bischoff-Grethe et al [14] both addressed the effects of their FFR method on skull stripping; however, it would be interesting to analyze the effects of these methods on other processing steps as well as multiple skull stripping methods, all in the same dataset for an objective comparison. Next, to remedy those errors, analysis methods and processing steps should be made robust against the absence or distortion of facial features. As an example, facial features could be removed from fixed reference images in registration steps or in reference templates (e.g., of tissue probabilities) in image processing pipelines. Moreover, it would also be helpful to study if changing the default settings of the automated methods improves the segmentation.
The measured volumes of the automated methods are affected-both random (low ICC) and systematic (significant differences). The random effects are mostly visible in the volume change of the NBV and GBV after FFR processing and the significant differences are visible in the volume change of BV and GBV. The Bland-Altman plots show that the volume changes are not dependent on the measured volumes. FFR affects not only the measured volumes of the automated methods but also the extent and precise spatial location of the WML and the glioblastomas, as demonstrated by the overlap analyses. For the WML in MS, median FP and FN fractions ranged between about 10 and 25% of the median total WMLV. Similar effects were observed for glioblastoma, with median FN fractions between about 15 and 20%, and median FP fractions between about 6 and 9%. Both the volumetric and spatial results indicate that the differences between full image segmentations and FFR-processed segmentations are substantial. This is unexpected, especially for LV and GBV, as given  that both the MS lesions and the glioblastoma are located within the region occupied by brain tissue that should not be, and judging from our visual inspections indeed was not, affected by the FFR methods. Both in MS and glioblastoma, the exact location and extent of pathological changes are of importance; therefore, these artifactual post FFR segmentation changes should be investigated in more detail and methods should be devised and tested to mitigate these effects.
Our results showed that the effects of FFR on current methods are a common problem across domains: brain volumes, MS lesion volumes, and glioblastoma volumes were all to some degree affected by FFR. The next step would be to investigate how to overcome such issues for SIENAX, LST-LPA, and BraTumIA, or in a broader sense, to study and mitigate sources of error after FFR for multiple methods aimed at brain volume, MS lesion, or glioblastoma segmentation. Another option would be to consider removing the facial features from fixed reference images in registration steps or in reference templates (e.g., of tissue probabilities) in image processing pipelines.
It should be noted that in this study, we did not test if the FFR methods indeed made the participant unrecognizable. However, we observed that the FFR methods in some cases seemed to leave parts of the face intact. We did not assess whether this made the person recognizable, because this would require a more rigorous setup outside the scope of this study. However, it would be important to establish guidelines on how to make participants unrecognizable, specifically which parts of the face should be removed or otherwise processed to ensure participants' privacy. Moreover, for protecting participants' privacy, it may be important to take into account that reconstruction of removed or deformed facial features may be possible [31].
In conclusion, this study highlighted a new challenge to the neuroimaging research community, which is to ensure highquality neuroimaging research while protecting participants' privacy. Our results demonstrate that facial features removal of brain MRI can lead both to failure of automated analysis methods (mostly by LST-LPA and BraTumIA) and to changes in volumes obtained by the analysis methods, including both Brandom^variation (mostly by NBV and GBV) and systematic differences (mostly by BV and GBV). Therefore, volumetric image analysis methods need to be carefully assessed and optimized with regard to FFR methods, in order to ensure the reliability of clinical research outcomes while protecting participants' privacy in multicentric, collaborative studies. This could be done by improving image registration accuracy after FFR, addressing in more detail the effect of FFR methods on other processing steps or by developing methods that are tailored to images from which facial features have been removed.