Back to the Roots: Reconstructing Large and Complex Cranial Defects using an Image-based Statistical Shape Model

Designing implants for large and complex cranial defects is a challenging task, even for professional designers. Current efforts on automating the design process focused mainly on convolutional neural networks (CNN), which have produced state-of-the-art results on reconstructing synthetic defects. However, existing CNN-based methods have been difficult to translate to clinical practice in cranioplasty, as their performance on large and complex cranial defects remains unsatisfactory. In this paper, we present a statistical shape model (SSM) built directly on the segmentation masks of the skulls represented as binary voxel occupancy grids and evaluate it on several cranial implant design datasets. Results show that, while CNN-based approaches outperform the SSM on synthetic defects, they are inferior to SSM when it comes to large, complex and real-world defects. Experienced neurosurgeons evaluate the implants generated by the SSM to be feasible for clinical use after minor manual corrections. Datasets and the SSM model are publicly available at https://github.com/Jianningli/ssm.


I. INTRODUCTION
Before deep learning gained wide popularity [1], statistical shape model (SSM) and its variants (e.g., active shape models (ASMs) [2], active blobs [3] and active appearance models (AAMs) [4]) were broadly adopted in medical reconstruction and segmentation tasks, such as the reconstruction of craniofacial defects [5][6][7][8][9][10] and human rib cage [11], and the segmentation of hip joints [12] and organs [13].In contrast to the latent shape features learned by deep neural nets, which are difficult to interpret, SSM offers the option to express a shape in an explicit manner, by linearly combining the mean shape and the principal modes of shape variations of a given shape pool.Surface meshes were common choices for anatomical shape representation in many SSM-based studies.Establishing dense point correspondence among the meshes is deemed the most demanding part in building an SSM, especially when the medical images, from which the meshes are derived, are of high resolution [14].Methods that establish points correspondence automatically are typically based on a mesh-to-mesh registration procedure (e.g., Iterative Closest Point [15]), where the meshes are registered to a reference mesh through a similarity transformation (scaling, rotation and translation).However, popular state of the art segmentation methods and various medical applications are image-based [16][17][18][19][20].It is therefore desirable to circumvent the imageto-mesh conversion procedure and build an SSM directly on images [21][22][23].
Automatic cranial implant design is another typical application that uses images as the initial shape representation [24].Existing deep learning-based methods usually train a deep neural net on hundreds of skull images with either synthetic defects [25][26][27][28] or clinical defects [29], depending on the availability of the clinical images.These approaches are data-and computation-intensive, and most importantly, the quality of their reconstructions for large and complex defects, which are common in cranioplasty, remains inadequate for clinical use [30,31].The failure can largely be attributed to domain shift: the synthetic defects in the training set have different distribution to that of the test set.Augmenting the training set intensively is a potentially practical and effective solution to the problem [32].However, current development in data augmentation-enhanced deep learning is still evaluated as substandard by clinical experts [30,32].
A method that is independent from and insensitive to the defects may optimally avoid the domain shift problem in cranial defect reconstruction.To this end, we propose an SSMbased method for automatic cranial implant design.Unlike previous mesh-based SSM for craniofacial defect reconstruction [6,7,9], our SSM is built directly on the volumetric skull images.We show that dense point correspondence among the training skull images can simply be achieved through an image registration and warping step, and the mean shape and shape variations of the skulls can be calculated thereafter.Besides the expected robustness against large and complex cranial defects, another favorable property of an SSM-based method is that the skull shapes can be expressed mathematically and explicitly, unlike deep learning-based approaches that try to learn an implicit and latent shape representation.
The proposed SSM is evaluated on both synthetic defects and irregularly shaped clinical defects from three cranial implant design tasks.We show that even though deep learning still beats SSM on synthetic defect reconstruction, the performance of SSM is superior and stable when it comes to large and complex clinical defects.

A. Overview
The main workload of building an SSM lies in finding the mean shape S and the primary shape variations φ of a given shape pool, as specified by Equation 1: λ i is the weight of variation φ i .C is the number of shape variations chosen for reconstructing the new shape S. Let X be a shape pool containing C binary volumetric images x i ∈ {0, 1} Di , in which the j th image x j (0 < i, j C) is selected as a reference.D i is the image dimension.The non-zero voxels in these images constitute the geometry of the shapes and are regarded by default as the shape landmarks.Therefore, establishing the point correspondences between all the images in the shape pool can be achieved by simply registering x i (i = j) to the reference image x j : Tr is a transformation that warps the images in X into the space of x j : be the set of warped shapes.The mean shape S ∈ R Dj of X is calculated as: To extract the shape variations φ i ∈ R Dj , principal component analysis (PCA) [33] ) be the set of chosen shape variations.Transforming X into the PCA space can be achieved via: X pca ∈ R C×C .The X pca is given by the PCA function from the scikit-learn package and X −1 is computed from the training set X. Therefore, we can calculate the variation matrix Φ as follows 1 : X −1 is a pseudo inverse of X .Given a test shape y, it is first registered to the reference image y = Tr(y), y ∈ R Dj and then mapped into the PCA space defined by the shape pool X: We rescale λ i to [0,1] via: (λ i − min(λ))/max(λ) − min(λ).Given λ, Φ and S, the new shape can be computed according to Equation 1.In our work, Tr is chosen to be a similarity transformation.The reconstructed shapes can be warped to their original space via an inverse transformation Tr −1 .

B. Volumetric Shape Completion
1) Shape Warping: An intuitive way to complete a defective shape y is to warp it to the space of a complete shape x j , which can be achieved through a registration process (i.e., y = Tr(y)).Since the anatomical landmarks of the two shapes are aligned because of registration, a following subtraction operation2 between the two shapes can yield the missing portion of the defective shape y m : The addition of y and y m produces the complete shape corresponding to y .By inverting the registration, we can The concept is similar to that of a template-based shape completion approach [34], in which the missing part of a defective shape is taken from a complete template shape.The choice of the template shape affects the authenticity of y m .Optimally, a shape x j that is general and representative of the shape class should be chosen as the template, to ensure that the registration error between x j and y is small and the missing part is taken from anatomically close regions on the template.The template shape can be from a single image like x j , or the mean shape S of a shape pool, as specified in Equation 3.
2) SSM for Volumetric Shape Completion: If the shape pool X consists of complete shapes, while the test shape y refers to a defective shape, applying Equation 1 -Equation 6would give the complete counterpart corresponding to y.In this sense, SSM can be used for shape completion tasks.In [35], the authors used PCA for skull shape completion and showed that, by applying PCA and an inverse PCA consecutively to a defective skull, a complete skull can be obtained.Equation 1 and Equation 6give the mathematical explanation: the PCA computes the skull shape variations Φ from the training samples and the weights λ from the warped defective skull y , while the inverse PCA, according to the the implementation of inverse_transform from the scikit-learn package, computes: which is equivalent to Equation 1. Incorporating Equation 6into Equation 9we get: In both [35] and Equation 1, the principal components of a defective skull are used as the weights of the shape variations.
An obvious shortcoming is that, if the defects are too large, the principal components computed from a defective skull might not reflect the true distribution of the shape variations of a complete skull.For example, given a defective skull whose facial bone far outweighs the cranium due to a large cranial defect, the weight λ k of the variation concerning the facial area φ k (0 < k C) would overwhelm the other variations, resulting in an inappropriate reconstruction of the region of interest (ROI, i.e., the cranium) for the cranial implant design task.To address this problem, we first use a shape completion network (denoted as DL) 3 to generate the complete counterpart of the defective skull y and use the completed skull to calculate λ: However, whether using Equation 11for weight calculation has an positive effect on the results than using Equation 6depends on the quality of the DL reconstructions.

A. Dataset and Metrics
We evaluated our method on three datasets: the 11 clinical cases of defective skulls from Tasks 2 of the AutoImplant II challenge 4 , the 29 craniotomy skulls from MUG500+ [37], and the 110 test skulls with synthetic defects from Task 3 of AutoImplant II.To conform to the evaluation scheme of the AutoImplant II challenge, we measure the reconstruction accuracy using dice similarity coefficient (DSC), border DSC and 95 percentile Hausdorff distance (HD95).
The complete skulls from the training set of Task 3 were used as the shape pool X.The image dimension is

B. Reconstruction of Synthetic Cranial Defects
The 110 test skulls from Task 3 contain synthetic defects.In this experiment, we evaluate different methods of creating a skull template for shape warping: averaging 30 complete skulls ( S (30)), averaging 50 complete skulls ( S (50)) and using only a single skull (x j ).The 30 skulls are a subset of the 50 skulls.We also evaluate how the weights of shape variations λ affect the reconstruction accuracy of SSM (Equation 1): λ computed via Equation 6(SSM (30)) and λ computed via Equation 11(SSM (30) + DL).The CNN-based shape completion network -DL is taken from [36,27] 6 .Besides, shape reconstruction using only the shape variations is also evaluated: For the former, the λ i is set to one.For the latter, λ i is calculated regularly based on Equation 6.The DSC, bDSC and HD95 for these shape completion methods are reported in Table I and Figure 3.The results bear important implications: (1) Using a single skull image x j as the template for shape warping can produce reasonable results, qualitatively and quantitatively (third row, Table I, and Figure 2 E).However, it should be noted that the conclusion applies only to the cranial implant design task, where the ROI to reconstruct is the structurally uncomplicated cranium.The single image-based template would likely to fail for reconstructing individual-specific facial bones, which contain complex and subtle structures.(2) Shape template derived from a single shape (x j ) produces comparable results to that from a mean shape averaged from 30 ( S (30)) or 50 ( S (50)) images.Figure 2 gives an example of the results obtained using shape warping.We can see that S(30) (Figure 2 B) shows no noticeable difference on the cranium compared  to x j (Figure 2 D).As a result, subtracting the input from the templates (Equation 7) produces similar implants.The main difference lies in the facial area and the interior subtle structures (Figure 2 C and E).Applicable to both statements (1) and ( 2), the reconstruction accuracy depends largely on how well the target matches with the template on the ROI (e.g., cranium or facial bones) during the warping and registration process.It is relatively easier to register among different craniums than different facial bones.For a facial reconstruction task (e.g., facial implant design), a mean shape, as illustrated in Figure 2 (B), possesses significantly more facial landmarks than a single image, which might potentially make the facial registration more accurate.(3) The weights of the shape variations affect the accuracy of cranium reconstruction.An analysis of the shape variation matrix Φ reveals that, around 53% of the C = 30 variations are related to the full skull and the remaining are associated with the facial area only, which do not contribute to the cranium reconstruction.The weight distribution (calculated based on Equation 6) of d0 i=1 λ i Φ i is shown in Figure 4. We can see that the largest three weights in d0 i=1 λ i Φ i correspond to the i = 2, 5, 7 variations: φ 2 (facial bones), φ 5 (full skull) and φ 7 (full skull).The results presented in the d0 i=1 λ i Φ i (λ i = 1) and d0 i=1 λ i Φ i rows of Table I only show marginal differences, meaning that the two full skull-related variations (φ 5 and φ 7 ) already carries the information necessary for reconstructing a complete skull.Figure 4 also shows the weight distribution λ calculated based on Equation 11.The variations corresponding to the three largest weights are φ 2 (facial bones), φ 5 (full skull) and φ 8 (full skull).Compared to 'SSM (30)', the deviation in weight distribution for 'SSM (30) + DL' leads to marginal decline in the quantitative scores, meaning that a better network or a training method should be chosen in order that the DL can benefit 'SSM (30)'.(4) In comparison to the deep learningbased approaches [38,32], the shape warping-and SSMbased methods achieve inferior results on synthetic defects quantitatively.However, it should be noted that both [38] and [32] used an intensively augmented dataset during training, while only 30 images were used to build the SSM.

C. MUG500+
This section presents the implant generation results on the craniotomy skulls from the MUG500+ dataset [37].Figure 5 shows how to manually post-process an implant (Figure 5 (A)) generated by subtracting the input defective skull from the skull reconstructed by SSM (30).First, a median smoothing filter is applied to the subtraction result to (partly) disconnect the implant from the noise (Figure 5 B).The smoothing kernel size should be chosen individually.Second, the scissors  functionality is used to erase the delineated area (Figure 5 C) to fully remove the noise and extract the implant.Figure 5 (D) shows the post-processing result.
Step (B) and (C) is done manually using 3D Slicer (https://www.slicer.org/)[39].Alternatively, the implant can be extracted automatically by applying morphological opening and connected component analysis consecutively to the subtraction result.However, it is recommended to follow the manual post-processing workflow in Figure 5 for an optimal outcome 7 .Figure 6 presents the automatically generated implants for some large and complex defects in the MUG500+ dataset.The implants are generated by S (50) and manually post-processed.We can see that some of the defects are large enough to cover half of the cranium or have rather irregular shapes.Nonetheless, the defects are still satisfactorily reconstructed.Notably, Figure 1 (the teaser image) shows that the method is still effective when multiple large defects exit on the craniotomy skull.The completed skulls obtained according to Equation 8 preserve the anatomical aesthetics of normal human skulls.The MUG500+ dataset contains the manually designed implants (i.e., surface models in .stlformat) for the 29 craniotomy skulls.The manual designs are converted to images (.nrrd) for a quantitative comparison with our automatic designs.The results are given in Table II.Keep in mind that, while interpreting the scores, the quantitative results are only a reflection of how well our automatic reconstructions match with the manual designs, which is subjective and experience-dependent 8 .Experts' manual and qualitative evaluation of the implants has closer correlation with the true quality of the reconstructions [30].

D. Task2
To show the utility of our methods on real clinical cases, the implant designs were created for the 11 clinical defective skulls from the Task 2 of the AutoImplant II challenge.As described in Ellis et al. [30], the implant designs were quantitatively compared to reconstructions from postoperative imaging of the actual implant the patients' received.Table III shows these quantitative results from S (50) and SSM (30), as well as from the AutoImplant II submissions.The S (50) and SSM (30) had the best Hausdorff 95 scores than all other submissions but scored worse than some other submissions in the dice similarity and boundary dice similarity scores.However, comparing the implant designs to the reconstructions of the implants from the postoperative CT imaging do not necessarily serve as a reliable metric for the quality of the implant designs [30].For this reason, the implant designs were also qualitatively evaluated by experienced neurosurgeon, MRA.The implant designs were judged based on completeness, false positive area, fit, and overall feasibility as described in Ellis et al. [30].As shown in Table IV, the S (50) implant designs had better overall feasibility, better fit, and less false positive area than the submissions from the Autoimplant II challenge.While none of the submissions from the Autoimplant II challenge were deemed feasible without modifications, 4 out of 11 of the S (50) designs were deemed feasible with only minor flaws.Therefore, the S (50) designs represent a substantial improvement in the clinical feasibility of implant designs.The main issues plaguing the S (50) implant designs were that they did not always extend far enough in the superior direction to fully restore the natural skull shape and that the implants were often too thick.
Table IV: Qualitative evaluation scores for Task 2 of the AutoImplant II challenge by neurosurgeon MRA.The scores have been normalized such that 0 is the lowest possible score and 1 is the highest possible score.Completeness (Comp) evaluates the amount of the defect that the implant design covers.False positive area (FPA) evaluates the amount of amount of implant design outside of the defect area.Fit evaluates the shape of the implant design relative to the defect.Feasibility evaluates whether the implant design could be used in surgery.See Ellis et al. for the qualitative analysis methods [30].IV.DISCUSSION In automatic cranial implant design, deep learning-based approaches that rely on a defect-complete or defect-implant pair for training often fail to generalize to large and complex cranial defects in the test set, since the synthetic defects used during training have different distributions to the real defects during evaluation (i.e., domain shift).One popular solution to this problem is resorting to intensive data augmentation: augment the defects [26,40] and/or augment the skull images [32,38].The former tries to create realistic synthetic defects for training.Data augmentation has shown to be effective to the generalization problem for deep learning in automatic cranial implant design.However, the computational cost is substantially increased.Besides, the patient-specific cranial defects can depict considerable variations among individuals, making it impractical to cover all defect patterns through augmentation.The SSM-based approach can circumvent the defect-related generalization problem, as an SSM relies only on the complete skulls available in the training set for training.Therefore, an SSM is insensitive to and unaffected by the changes in defect patterns in the test cases.One factor affecting the performance of an SSM is the registration accuracy.Since human craniums are structurally uncomplicated and topologically stable among individuals compared to the defects and facial bones, precise registration among different craniums is highly achievable.Therefore, an SSM or simply a shape warping-based approach, though methodologically simple, is ideal for the cranial defect reconstruction task.
Furthermore, due to the black-box nature of deep learning, the shape features learned by a deep neural network are often entangled and lacking interpretability.In contrast, the basic components of an SSM -the mean shape and shape variations, are explicitly expressed and therefore highly comprehensible to humans.In clinical settings, interpretability and transparency adds to the trust in the methods.

V. CONCLUSION
As an alternative to mesh-based SSM, we demonstrate in this paper that an SSM can be built directly on (volumetric) skull images.The image-based SSM of the skull is evaluated on three cranial defect reconstruction tasks, and the results reveal that even if state of the art deep learning-based approaches still prevail in reconstructing synthetic defects, our SSM-based method shows advantages in the reconstruction of large and complex defects that are common in cranioplasty.Besides, the SSM-based methods are not dependent on large quantities of training data as deep learning, making the proposed approach highly scalable and applicable in a clinical setting.

Figure 2 :
Figure 2: A: the input defective skull.B: the mean skull ( S (30)).C: the subtraction between B and A. D: the reference skull x j .E: the subtraction between D and A.
calculating Φ (Equation 4 and 5) from high resolution images is a computationally expensive process, we only used C = 30 (out of 100) complete skulls for experiments involving Φ.The reference skull x j is chosen to be case 001.nrrd in the Task 3 training set and Z j = 222.All the training and test samples are registered to 001.nrrd through a similarity transformation Tr.

Figure 4 :
Figure 4: λ (y axis) computed based on Equation 6 and Equation 11 for different shape variations (x axis refers to the index of variation).The averaged weights of the 110 test cases: 1 110

Figure 5 :
Figure 5: Manually extract the implants from the subtraction results using 3D Slicer.

Table I :
Quantitative results (mean DSC, bDSC and HD95) on the 110 test cases of Task 3.

Table III :
Quantitative results for Task 2 of the AutoImplant II challenge.