A self-supervised learning strategy for postoperative brain cavity segmentation simulating resections

Purpose Accurate segmentation of brain resection cavities (RCs) aids in postoperative analysis and determining follow-up treatment. Convolutional neural networks (CNNs) are the state-of-the-art image segmentation technique, but require large annotated datasets for training. Annotation of 3D medical images is time-consuming, requires highly trained raters and may suffer from high inter-rater variability. Self-supervised learning strategies can leverage unlabeled data for training. Methods We developed an algorithm to simulate resections from preoperative magnetic resonance images (MRIs). We performed self-supervised training of a 3D CNN for RC segmentation using our simulation method. We curated EPISURG, a dataset comprising 430 postoperative and 268 preoperative MRIs from 430 refractory epilepsy patients who underwent resective neurosurgery. We fine-tuned our model on three small annotated datasets from different institutions and on the annotated images in EPISURG, comprising 20, 33, 19 and 133 subjects. Results The model trained on data with simulated resections obtained median (interquartile range) Dice score coefficients (DSCs) of 81.7 (16.4), 82.4 (36.4), 74.9 (24.2) and 80.5 (18.7) for each of the four datasets. After fine-tuning, DSCs were 89.2 (13.3), 84.1 (19.8), 80.2 (20.1) and 85.2 (10.8). For comparison, inter-rater agreement between human annotators from our previous study was 84.0 (9.9). Conclusion We present a self-supervised learning strategy for 3D CNNs using simulated RCs to accurately segment real RCs on postoperative MRI. Our method generalizes well to data from different institutions, pathologies and modalities. Source code, segmentation models and the EPISURG dataset are available at https://github.com/fepegar/resseg-ijcars.


Motivation
Approximately one-third of epilepsy patients are drugresistant. If the epileptogenic zone (EZ), i.e., "the area of cortex indispensable for the generation of clinical seizures" [26], can be localized, resective surgery to remove the EZ may be curative. Currently, 40% to 70% of patients with refractory focal epilepsy are seizure-free after surgery [16]. This is, in part, due to limitations identifying the EZ. Retrospective studies relating presurgical clinical features and resected brain structures to surgical outcome provide useful insight to guide EZ resection [16]. To quantify resected structures, first, the resection cavity (RC) must be segmented on the postoperative magnetic resonance image (MRI). A preoperative image with a corresponding brain parcellation can then be registered to the postoperative MRI to identify resected structures.
RC segmentation is also necessary in other applications. For neuro-oncology, the gross tumor volume, which is the sum of the RC and residual and residual tumor volumes, is estimated for postoperative radiotherapy [10].
Despite recent efforts to segment RCs in the context of brain cancer [10,18], little research has been published in the context of epilepsy surgery. Furthermore, previous work is limited by the lack of benchmark datasets, released code or trained models, and evaluation is restricted to singleinstitution datasets used for both training and testing.

Related works
After surgery, RCs fill with cerebrospinal fluid (CSF). This causes an inherent uncertainty in delineating RCs adjacent to structures such as sulci, ventricles or edemas. Nonlinear registration has been presented to segment the RC for epilepsy [6] and brain tumor [4] surgeries by detecting non-corresponding regions between pre-and postoperative images. However, evaluation of these methods was restricted to a very small number of images. Furthermore, in cases with intensity changes due to the resection (e.g., brain shift, atrophy, fluid filling), non-corresponding voxels may not correspond to the RC.
Decision forests were presented for brain cavity segmentation after glioblastoma surgery, using four MRI modalities [18]. These methods, which aggregate hand-crafted features extracted from all modalities to train a classifier, can be sensitive to signal inhomogeneity and unable to distinguish regions with intensity patterns similar to CSF from RCs. Recently, a 2D convolutional neural network (CNN) was trained to segment the RC on MRI slices in 30 glioblastoma patients [10]. They obtained a 'median (interquartile range)' Dice score coefficient (DSC) of 84 (10) compared to ground-truth labels by averaging predictions across anatomical axes to compute the 3D segmentation. While these approaches require four modalities to segment the resection cavity, some of the modalities are often unavailable in clinical settings [9]. Furthermore, code and datasets are not publicly available, hindering a fair comparison across methods. Applying these techniques requires curating a dataset with manually obtained annotations to train the models, which is expensive.
Unsupervised learning methods can leverage large, unlabeled medical image datasets during training. In selfsupervised learning, training instances are generated automatically from unlabeled data and used to train a model to perform a pretext task. The model can be fine-tuned on a smaller labeled dataset to perform a downstream task [5]. The pretext and downstream tasks may be the same. For example, a CNN was trained to reconstruct a skull bone flap by simulating craniectomies on CT scans [17]. Lesions simulated in chest CT of healthy subjects were used to train models for nodule detection, improving accuracy compared to training on a smaller dataset of real lesions [25].

Contributions
We present a self-supervised learning approach to train a 3D CNN to segment brain RCs from T 1 -weighted (T 1 w) MRI without annotated data, by simulating resections during training. We ensure our work is reproducible by releasing the source code for resection simulation and CNN training, the trained CNN and the evaluation dataset. To the best of our knowledge, we introduce the first open annotated dataset of postoperative MRI for epilepsy surgery.
This work extends our conference paper [22] as follows: (1) we performed a more comprehensive evaluation, assessing the effect of the resection simulation shape on performance and evaluating datasets from different institutions and pathologies; (2) we formalized our transfer learning strategy.

Problem statement
The overall objective is to automatically segment RCs from postoperative T 1 w MRI using a CNN f θ parameterized by weights θ . Let X post : Ω → R and Y cavity : Ω → {0, 1} be a postoperative T 1 w MRI and its cavity segmentation label, respectively, where Ω ⊂ R 3 . X post and Y cavity are drawn from the data distribution D post . In model training, the aim is to minimize the expected discrepancy between the label Y cavity and network prediction f θ (X post ). Let L be a loss function that estimates this discrepancy (e.g., Dice loss). The optimization problem for the network parameters θ is: In a fully supervised setting, a labeled dataset is employed to estimate the expectation defined in (1) as: In practice, CNNs typically require an annotated dataset with a large n post to generalize well for unseen instances. However, given the time and expertise required to annotate scans, n post is often small. We present a method to artificially increase n post by simulating postoperative MRIs and associated labels from preoperative scans.

Simulation for domain adaptation and self-supervised learning
be a dataset of preoperative T 1 w MRI, drawn from the data distribution D pre . We propose to generate a simulated postoperative dataset D sim = {(X sim i , Y sim i )} n sim i=1 using the preoperative dataset D pre . Specifically, we aim to build a generative model φ sim : X pre → (X sim , Y sim ) that transforms preoperative images into simulated, annotated postoperative images that imitate instances drawn from the postoperative data distribution D post . D sim can then be used to estimate the expectation in (1): Simulated images can be generated from any unlabeled preoperative dataset. Therefore, the size of the simulated dataset can be much greater than the annotated dataset D post , i.e., n sim n post . The network parameters θ can be optimized by minimizing (3) using stochastic gradient descent, leading to a trained predictive function f θ sim . Finally, f θ sim can be fine-tuned on D post to improve performance on the postoperative domain D post .
Resection simulation for self-supervised learning φ sim takes images from D pre to generate training instances by simulating a realistic shape, location and intensity pattern for the RC. We present simulation of cavity shape and label in sections "Initial cavity shape" and "Cavity label", respectively. In section "Simulating cavities filled with CSF", we present our method to generate the resected image.

Initial cavity shape
To simulate a realistic RC, we consider its topological and geometric properties: it is a single volume with a non-smooth boundary. We generate a geodesic polyhedron with frequency f by subdividing the edges of an icosahedron f times and projecting each vertex onto a parametric sphere with a unit radius centered at the origin. This polyhedron models a spher- and faces F = f k ∈ N 3 n F k=1 , where n V and n F are the number of vertices and faces, respectively. Each face To create a non-smooth surface, S is perturbed with simplex noise [24], a procedural noise generated by interpolating pseudorandom gradients on a multidimensional simplicial grid. We chose simplex noise as it simulates natural-looking textures or terrains and is computationally efficient for multiple dimensions. The noise η : is a weighted sum of the noise contribution for ω different octaves, with weights {γ n−1 } ω n=1 controlled by the persistence parameter γ . The displacement δ of a vertex v is: where ζ is a scaling parameter to control smoothness and μ is a shifting parameter that adds stochasticity (equivalent to a random number generator seed). Each vertex v i is displaced radially to create a perturbed sphere: Next, a series of transforms is applied to V δ to modify the mesh's volume and shape. To add stochasticity, random rotations around each axis are applied to V δ with the rota- is a scaling transform, where (r 1 , r 2 , r 3 ) = r are semiaxes of an ellipsoid with volume v used to model the cavity shape. The semiaxes are computed as r 1 = r , r 2 = λr and r 3 = r /λ, where r = (3v/4) 1/3 and λ controls the semiaxes length ratios. 1 These transforms are applied to V δ to define the initial resection cavity surface

Cavity label
The simulated RC should not span both hemispheres or include extracerebral tissues such as bone or scalp. This section describes our method to ensure that the RC appears in anatomically plausible regions. A T 1 w MRI is defined as X pre : Ω → R. A full brain parcellation P : Ω → Z is generated [3] for X pre , where Z is the set of segmented structures. A cortical gray matter where represents the Hadamard product. Fig. 1 illustrates the process.

Simulating cavities filled with CSF
Brain RCs are typically filled with CSF. To generate a realistic CSF texture, we create a ventricle mask M V : Ω → {0, 1} from P, such that M V ( p) = 1 for all p within the ventricles and M V ( p) = 0 outside. Intensity values within the ventricles are assumed to have a normal distribution [14] with a mean μ CSF and standard deviation σ CSF calculated from voxel intensity values in We use Y sim to guide blending of X CSF and X pre as follows. A Gaussian filter is applied to Y sim to obtain a smooth alpha channel where * is the convolution operator and G N (σ ) is a 3D Gaussian kernel with standard deviations σ = (σ x , σ y , σ z ). Then, X CSF and X pre are blended by the convex combination We use σ > 0 to mimic partial-volume effects at the cavity boundary. The blending process is illustrated in Fig. 2.

Data
Public data for simulation T 1 w MRIs were collected from publicly available datasets Information eXtraction from Images (IXI), Alzheimer's Disease (AD) Neuroimaging Initiative (ADNI) and Open Access Series of Imaging Studies (OASIS), for a total of 1813 images. They are used as control subjects in our self-supervised experiments (section "Simulation for domain adaptation and self-supervised learning"). Note that, although we use the term "control" to refer to subjects that have not undergone resective surgery, they may have other neurological conditions. For example, subjects in ADNI may suffer from AD.

Multicenter epilepsy data
We evaluate the generalizability of our approach to data from several institutions: Milan (n = 20), Paris (n = 19), Strasbourg (n = 33) and EPISURG (n = 133). We curated the EPISURG dataset from patients with refractory focal epilepsy who underwent resective surgery between 1990 and 2018 at the National Hospital for Neurology and Neurosurgery (NHNN), London, United Kingdom. All images in EPISURG were defaced using a predefined face mask in the Montreal Neurological Institute (MNI) space to preserve patient identity. In total, there were 430 patients with postoperative T 1 w MRI, 268 of which had a corresponding preoperative MRI. EPISURG is available online and can be freely downloaded [21]. The same human rater (F.P.G.) annotated all images semi-automatically using 3D Slicer 4.10 [11].

Brain tumor datasets
The Brain Images of Tumors for Evaluation (BITE) dataset [19] consists of ultrasound and MRI of patients with brain tumors. We use 13 postoperative T 1 w MRIs with gadolinium contrast enhancement (T 1 wCE) to perform a qualitative assessment of our model's generalization to images from a substantially different domain (contrast-enhanced images) and different pathology, where different surgical techniques may affect RC appearance.

Preprocessing
For all images, the brain was segmented using ROBEX [15]. They were resampled into the MNI space using sinc interpolation to preserve image quality. After resampling, all images had a 1-mm isotropic resolution and size 193 × 229 × 193.

Network architecture and implementation details
We used the PyTorch deep learning framework, training with automatic mixed precision (AMP) on two 32-GB TESLA V100 GPUs. We used Sacred [13] to configure, log and visualize experiments. We implemented a 3D U-Net [7] variant using two contractive and expansive blocks, upsampling with trilinear interpolation for the synthesis path and 1/4 of the filters for each convolutional layer. We used dilated convolutions, starting with a dilation factor of one, then increased or decreased in steps of one after each contractive or expansive block, respectively. Our architecture has the same receptive field (88 mm 3 ) but ≈ 77× fewer parameters (246,156) than the original 3D U-Net, reducing overfitting and computational burden.
Convolutional layers were initialized using He's method, and followed by batch normalization and nonlinear PReLU activation functions. We used adaptive moment estimation (AdamW) to adjust the learning rate, with weight decay of 10 −2 , and a learning scheduler that divides the learning rate by ten every 20 epochs. We optimized our network to minimize the mean soft Dice loss of each mini-batch. For training, a mini-batch size of ten images (five per GPU) was used. Self-supervised training took approximately 27 h. Finetuning on a small annotated dataset took approximately 7 h.

Processing during training
We use TorchIO transforms to load, preprocess and augment our data during training [23]. Instead of preprocessing images with denoising or bias removal, we simulate different artifacts in the training instances so that our models are robust to artifacts. Our preprocessing and augmentation transforms are: (1) random simulation (RS) of resections (self-supervised training only), (2) histogram standardization, (3) Gaussian blurring or RS of anisotropic spacing, (4) RS of MRI ghosting, (5) spike and (6) motion artifacts, (7) RS of bias field inhomogeneity, (8) standardization of foreground to zero-mean and unit variance, (9) Gaussian noise, (10) RS of affine or free-form transformations, (11) random flip around the sagittal plane and (12) crop to a tight bounding box around the brain. We refer the reader to our GitHub repository for details.

Experiments
Overlap measurements are reported as 'median (interquartile range)' DSC. No postprocessing is performed for evaluation, except thresholding at 0.5. We analyzed differences in model performance using a one-tailed Mann-Whitney U test (as DSCs were not normally distributed) with a significance threshold of α = 0.05 and Bonferroni correction for n experiments: α Bonf = α n(n−1) .

Fig. 3
Simulation of RCs with increasing shape complexity (section "Resection simulation for self-supervised learning"): cuboid (a), ellipsoid (b) and ellipsoid perturbed with simplex noise (c)

Self-supervised learning: training with simulated resections only
In our first experiment, we assess the relation between the resection simulation complexity and the segmentation performance of the model. We train our model with simulated resections on the publicly available dataset , where n pre = 1813 (section "Data"). We use 90% of the images in D pre for the training set D pre,train and 10% for the validation set. At each training iteration, b images from D pre,train are loaded, resected, preprocessed and augmented to obtain a mini-batch of b training instances Note that the resection simulation is performed on the fly, which ensures that the network never sees the same resection during training. Models were trained for 60 epochs, using an initial learning rate of 10 −3 . We use the model weights from the epoch with the lowest mean validation loss obtained during training for evaluation. Models were tested on the 133 annotated images in EPISURG.
To investigate the effect of the simulated cavity shape on model performance, we modify φ sim to generate cuboidshaped (Fig. 3a) or ellipsoid-shaped (Fig. 3b) resections and compare with the baseline "noisy" ellipsoid (Fig. 3c). The cuboids and ellipsoid meshes are not perturbed using simplex noise, and cuboids are not rotated.

Fine-tuning on small clinical datasets
We assessed the generalizability of our baseline model by fine-tuning it on small datasets from four institutions that may use different surgical approaches and acquisition protocols (including contrast enhancement and anisotropic spacing in Strasbourg) (section "Multicenter epilepsy data"). Additionally, we fine-tuned the model on 20 cases from EPISURG with the lowest DSC in section "Self-supervised learning: training with simulated resections only".
For each dataset, we load the pretrained baseline model, initialize the optimizer with an initial learning rate of 5 × 10 −4 , initialize the learning rate scheduler and fine-tune all layers simultaneously for 40 epochs using 5-fold crossvalidation. We use model weights from the epoch with the lowest mean validation loss for evaluation. To minimize data leakage, we determined the above hyperparameters using the validation set of one fold in the Milan dataset.
We observed a consistent increase in DSC for all finetuned models, up to a maximum of 89.2 (13.3) for the Milan dataset. For comparison, inter-rater agreement between human annotators in our previous study was 84.0 (9.9) [22]. Quantitative evaluation is illustrated in Fig. 4.

Qualitative evaluation on brain tumor resection dataset
We used the BITE dataset [19] to evaluate the ability of our self-supervised model to segment RCs on images from a different institution, modality and pathology than the datasets used for quantitative evaluation. For postprocessing, all but the largest binary connected component were removed. The model successfully segmented the RC on 11/13 images, even though some contained challenging features (Fig. 5).

Qualitative evaluation on intraoperative image
We used our baseline model to segment the RC on one intraoperative MRI from our institution. Despite the large domain shift between the training dataset and the intraoperative image, which includes a retracted skin flap and a missing bone flap, the model was able to correctly estimate the RC, discarding similar regions filled with CSF or air (Fig. 6).

Discussion and conclusion
We addressed the challenge of segmenting postoperative brain resection cavities from T 1 w MRI without annotated data. We developed a self-supervised learning strategy to train without manually annotated data and a method to simulate RCs from preoperative MRI to generate training data. Our novel approach is conceptually simple, easy to imple- Fig. 4 DSC without (blue) and with (orange) fine-tuning of the model training using self-supervision. Horizontal lines in the boxes represent the first, second (median) and third quartiles. EPISURG (worst) comprises the 20 cases from EPISURG with the lowest DSC in the experiment described in section "Self-supervised learning: training with simulated resections only". Numbers in parentheses indicate subjects per dataset

Fig. 5
Qualitative results on postoperative brain tumor T 1 wCE MRI. The model is robust to: air and CSF in the RC (a), anisotropic spacing (b), presence of edema (c) and a different modality than used for training (all). Note that these images are from a different institution, modality and pathology than the datasets used for quantitative evaluation. Manual annotations are not available The resection simulation is computationally efficient (< 1 s), so it can run during training as part of a data augmentation pipeline. It is implemented within the TorchIO framework [23] to leverage other data argumentation techniques during training, enabling our model to have a robust performance across MRI of variable quality.
Modeling a realistic cavity shape is important (section "Self-supervised learning: training with simulated resections only"). Our model generalizes well to clinical data from different institutions and pathologies, including epilepsy and glioma. Models may be easily fine-tuned using small annotated clinical datasets to improve performance. Moreover, our resection simulation and learning strategy may be extended to train with arbitrary modalities, or synthetic modalities generated from brain parcellations [1]. Therefore, our strategy can be adopted by institutions with a large amount of unla-beled data, while fine-tuning and testing on a smaller labeled dataset.
Poor segmentation performance is often due to very small cavities, where the cavity was not detected, and large brain shift or subdural edema, where regions were incorrectly segmented. The former issue may be overcome by training with a distribution of cavity volumes which oversamples small resections. The latter can be addressed by extending our method to simulate displacement with biomechanical models or nonlinear deformations of the brain [12].
We showed that our model correctly segmented an intraoperative image, respecting imaginary boundaries between brain and skull, suggesting a good inductive bias of human neuroanatomy. Qualitative results and execution time, which is in the order of milliseconds, suggest that our method could be used intraoperatively, for image guidance during resection or to improve registration with preoperative images by masking the cost function using the RC segmentation [2].
Segmenting the RC may also be used to study potential damage to white matter tracts postoperatively [27]. Our method could be easily adapted to simulate other lesions for selfsupervised training, such as cerebral microbleeds [8], narrow and snake-shaped RCs typical of disconnective surgeries [20] or RCs with residual tumor [18].
As part of this work, we curated and released EPISURG, an MRI dataset with annotations from three independent raters. EPISURG could serve as a benchmark dataset for quantitative analysis of pre-and postoperative imaging of open resection for epilepsy treatment. To the best of our knowledge, this is the first open annotated database of postresection MRI for epilepsy patients.