Introduction

Before deep learning gained wide popularity [1], statistical shape model (SSM) and its variants (e.g., active shape models (ASMs) [2], active blobs [3] and active appearance models (AAMs) [4]) were broadly adopted in medical reconstruction and segmentation tasks, such as the reconstruction of craniofacial defects [5,6,7,8,9,10] and human rib cage [11], as well as the segmentation of hip joints [12] and organs [13]. In contrast to the latent shape features learned by deep neural nets, which are difficult to interpret, SSM offers the option to express a shape in an explicit manner, by linearly combining the mean shape and the principal modes of shape variations of a given shape pool. Surface meshes were common choices for anatomical shape representation in many SSM-based studies. Establishing dense point correspondence among the meshes is deemed the most demanding part in building an SSM, especially when the medical images, from which the meshes are derived, are of high resolution [14]. Methods that establish point correspondence automatically are typically based on a mesh-to-mesh registration procedure (e.g., Iterative Closest Point [15]), where the meshes are registered to a reference mesh through a similarity transformation (scaling, rotation and translation). However, popular state-of-the-art segmentation methods and various medical applications are image-based [16,17,18,19,20]. It is therefore desirable to circumvent the image-to-mesh conversion procedure and build an SSM directly on volumetric images [21,22,23]. Automatic cranial implant design is another typical application that uses images as the initial shape representation i.e., binary voxel occupancy grid [24]. Existing deep learning-based methods usually train a deep neural net on hundreds of skull images with either synthetic defects [25,26,27,28] or real-world clinical defects [29], which are however often not publicly available or large enough for training deep models. These approaches are data- and computation- intensive, and most importantly, the reconstruction quality for large and complex defects remains inadequate for clinical use [30, 31]. The failure can largely be attributed to distribution shift of the defects: the synthetic defects in the training set have different distribution from that of the test set. Augmenting the training set intensively has proven to be an effective solution to the distribution adaptation problem [32]. However, current development in data augmentation-enhanced deep learning is still evaluated as substandard by experienced neurosurgeons [30, 32]. A method that is insensitive to the defects may potentially avoid the distribution shift problem in cranial defect reconstruction. To this end, we propose an SSM-based method for automatic cranial implant design that relies only on complete skulls. Unlike previous mesh-based SSM for craniofacial defect reconstruction [6, 7, 9], our SSM is built directly on volumetric skull images represented as binary voxel grids. We show that the dense point correspondence for mesh-based SSM can be approximated through an image registration and warping step among the voxel grids, and that the mean shape and shape variations of the skulls can be calculated thereafter. Besides the expected robustness against large and complex cranial defects, another favorable property of an SSM-based method is that the skull shapes can be expressed explicitly, unlike deep learning-based approaches that often learn an uninterpretable shape representation. The proposed SSM is evaluated on both synthetic defects and irregularly shaped clinical defects from three cranial implant design tasks. We show that even though deep learning still beats SSM on synthetic defect reconstruction, its performance is inferior when it comes to large and complex clinical defects (Fig. 1).

Fig. 1
figure 1

Teaser: reconstructing a skull with two defects. The skulls are shown in yellow and the implants in green

Method

Overview

The main workload of building an SSM lies in finding the mean shape \(\bar{S}\) and the primary shape variations \(\mathbf {\phi }\) of a given shape pool, as specified by Eq. (1):

$$\begin{aligned} S=\bar{S}+\sum _{i=1}^{C}\lambda _i\mathbf {\phi }_i \end{aligned}$$
(1)

\(\lambda _i\) is the weight of variation \(\mathbf {\phi }_i\). C is the number of shape variations chosen for reconstructing the new shape S. Let \(\textbf{X}\) be a shape pool containing C binary volumetric images \(\textbf{x}_i \in \{0,1\}^{D_i}\), in which the \(j_{th}\) image \(\textbf{x}_j\) (\(0<i, j\leqslant C\)) is selected as a reference. \(D_i\) is the image dimension. The non-zero voxels in these images constitute the geometry of the shapes and are regarded by default as the shape landmarks. Therefore, establishing the point correspondences between all the images in the shape pool can be achieved by simply registering \(\textbf{x}_i\) (\(i\ne j\)) to the reference image \(\textbf{x}_j\):

$$\begin{aligned} \textbf{Tr}: x_i\rightarrow x_j \end{aligned}$$
(2)

\(\textbf{Tr}\) is a transformation that warps the images in \(\textbf{X}\) into the space of \(\textbf{x}_j\): \(x'_i=\textbf{T} (x_i)\), \(x'_i \in R^{D_j}\). Let \(X'=\begin{Bmatrix} x'_i \,| \, i\in Z, 0< i\leqslant C \end{Bmatrix}\), \(\mathbf {X'} \in R^{ D_j \times C}\) be the set of warped shapes. The mean shape \(\bar{S} \in R^{D_j}\) of \(\textbf{X}'\) is calculated as:

$$\begin{aligned} \bar{S}=\frac{1}{C}\sum _{i}^{C}x'_i \end{aligned}$$
(3)

To extract the shape variations \(\mathbf {\phi }_i \in R^{D_j}\), principal component analysis (PCA) [33] is used. Let \(\mathbf {\Phi }=\begin{Bmatrix}\phi _i\,|\,i\in Z, 0< i\leqslant C\end{Bmatrix}\), \(\mathbf {\Phi } \in R^{C \times D_j}\) (\(C\ll D_j\)) be the set of chosen shape variations. Transforming \(X'\) into the PCA space can be achieved via:

$$\begin{aligned} \mathbf {\Phi }\cdot X'=X_{pca}' \end{aligned}$$
(4)

\(X_{pca}' \in R^{C \times C}\). The \(X_{pca}'\) is given by the PCA function from the scikit-learn package and \(X'^{-1}\) is computed from the training set X. Therefore, we can calculate the variation matrix \(\mathbf {\Phi }\) as followsFootnote 1:

$$\begin{aligned} \mathbf {\Phi } =X_{pca}'\cdot X'^{-1} \end{aligned}$$
(5)

\(X'^{-1}\) is a pseudo inverse of \(X'\). Given a test shape y, it is first registered to the reference image \(y'=\textbf{Tr}(y), y' \in R^{D_j}\) and then mapped into the PCA space defined by the shape pool X:

$$\begin{aligned} \lambda =\mathbf {\Phi } \cdot y' \end{aligned}$$
(6)

\(\lambda =\begin{Bmatrix}\lambda _i \,|\,i\in Z, 0< i\leqslant C\end{Bmatrix}\). We rescale \(\lambda _i\) to [0,1] via: \((\lambda _i -min(\lambda ))/max(\lambda )-min(\lambda )\). Given \(\lambda\), \(\mathbf {\Phi }\) and \(\bar{S}\), the new shape can be computed according to Eq. (1). In our work, \(\textbf{Tr}\) is chosen to be a similarity transformation. The reconstructed shapes can be warped to their original space via an inverse transformation \(\textbf{Tr}^{-1}\).

Volumetric Shape Completion

Shape Warping

An intuitive way to complete a defective shape y is to warp it to the space of a complete shape \(x_j\), which can be achieved through a registration process (i.e., \(y'=\textbf{Tr}(y)\)). Since the anatomical landmarks of the two shapes are aligned because of registration, a following subtraction operationFootnote 2 between the two shapes can yield the missing portion of the defective shape \(y_m\):

$$\begin{aligned} y_m = x_j - y' \end{aligned}$$
(7)

The addition of \(y'\) and \(y_m\) produces the complete shape corresponding to \(y'\). By inverting the registration, we can obtain the complete shape \(y_c\) corresponding to y in its original space:

$$\begin{aligned} y_c=\textbf{Tr}^{-1}(y_m + y') \end{aligned}$$
(8)

The concept is similar to that of a template-based shape completion approach [34], in which the missing part of a defective shape is taken from a complete template shape. The choice of the template shape affects the authenticity of \(y_m\). Optimally, a shape \(x_j\) that is general and representative of the shape class should be chosen as the template, to ensure that the registration error between \(x_j\) and \(y'\) is small and the missing part is taken from anatomically close regions on the template. The template shape can be from a single image like \(x_j\), or the mean shape \(\bar{S}\) of a shape pool, as specified in Eq. (3).

SSM for Volumetric Shape Completion

If the shape pool X consists of complete shapes, while the test shape y refers to a defective shape, applying Eqs. (1)–(6) would give the complete counterpart corresponding to y. In this sense, SSM can be used for shape completion tasks. In [35], the authors used PCA for skull shape completion and showed that, by applying PCA and an inverse PCA consecutively to a defective skull, a complete skull can be obtained. Equations (1) and (6) give the mathematical explanation: the PCA computes the skull shape variations \(\mathbf {\Phi }\) from the training samples and the weights \(\lambda\) from the warped defective skull \(y'\), while the inverse PCA, according to the the implementation of inverse_transform from the scikit-learn package, computes:

$$\begin{aligned} S=\bar{S} +\lambda \cdot \mathbf {\Phi } \end{aligned}$$
(9)

which is equivalent to Eq. (1). Incorporating Eq. (6) into Eq. (9) we get:

$$\begin{aligned} S=\bar{S} + \mathbf {\Phi }\cdot y'\cdot \mathbf {\Phi }=\bar{S}+y'\cdot \mathbf {\Phi }^T\cdot \mathbf {\Phi } \end{aligned}$$
(10)

In both [35] and Eq. (1), the principal components of a defective skull are used as the weights of the shape variations. An obvious shortcoming is that, if the defects are too large, the principal components computed from a defective skull might not reflect the true distribution of the shape variations of a complete skull. For example, given a defective skull whose facial bone far outweighs the cranium due to a large cranial defect, the weight \(\lambda _{k}\) of the variation concerning the facial area \(\phi _{k}\) (\(0<k\leqslant C\)) would overwhelm the other variations, resulting in an inappropriate reconstruction of the region of interest (ROI, i.e., the cranium) for the cranial implant design task.

Experiment and Results

Dataset and Metrics

We evaluated our method on three datasets: the 11 clinical cases of defective skulls from Tasks 2 of the AutoImplant II challenge [36],Footnote 3 the 29 craniotomy skulls from MUG500+ [37], and the 110 test skulls with synthetic defects from Task 3 of AutoImplant II. To conform to the evaluation scheme of the AutoImplant II challenge, we measure the reconstruction accuracy using dice similarity coefficient (DSC), border DSC and 95 percentile Hausdorff distance (HD95). The complete skulls from the training set of Task 3 were used as the shape pool X. The image dimension is \(D_i= 512\times 512 \times Z_i\)Footnote 4 (\(Z_i\) differs for different images). As calculating \(\mathbf {\Phi }\) (Eqs. (4) and (5)) from high resolution images is a computationally expensive process, we only used \(C=30\) (out of 100) complete skulls for experiments involving \(\mathbf {\Phi }\). The reference skull \(x_j\) is chosen to be case 001.nrrd in the Task 3 training set and \(Z_j=222\). All the training and test samples are registered to 001.nrrd through a similarity transformation \(\textbf{Tr}\).

For the synthetic defect reconstruction task (“Reconstruction of Synthetic Cranial Defects” section), four deep learning-based approaches, i.e., [27, 32, 38, 39] are chosen for a comparison study. For Task 2 (“Task2 of the AutoImplant II Challenge” section), two deep learning-based approaches are used for comparison [31, 32]. The networks used in these approaches are based on an auto-encoder [27, 39] or U-Net [31, 32, 38]. In [32, 38], the authors adopted a registration-based data augmentation scheme that substantially increased the number of training samples, and trained a 3D U-Net on the augmented dataset. Their solutions won the first place in the AutoImplant challenge. In [27, 39], the authors used a standard 3D auto-encoder trained on the original challenge dataset without augmentation. In [31], a two-step learning process was employed, wherein one 3D U-Net learns the shape of the implant, while a subsequent 3D U-Net refines the learned implant.

Reconstruction of Synthetic Cranial Defects

The 110 test skulls from Task 3 contain synthetic defects. In this experiment, we evaluate different methods for creating a skull template for shape warping: averaging 30 complete skulls (denoted as \(\bar{S}\) (30)), averaging 50 complete skulls (denoted as \(\bar{S}\) (50)) and using only a single skull (denoted as \(x_j\)). The 30 skulls are a subset of the 50 skulls. The results are compared with two deep learning-based methods from [27, 39].Footnote 5 Besides, shape reconstruction using only the shape variations is also evaluated: \(\sum _{i=1}^{d_0}\lambda _i\Phi _i\) (\(\lambda _i=1\)) and \(\sum _{i=1}^{d_0}\lambda _i\Phi _i\). For the former, the \(\lambda _i\) is set to one. For the latter, \(\lambda _i\) is calculated based on Eq. (6). The DSC, bDSC and HD95 for these shape completion methods are reported in Table 1 and Fig. 2.

Table 1 Quantitative results (mean DSC, bDSC and HD95) on the 110 test cases of Task 3
Fig. 2
figure 2

Boxplots of DSC, bDSC and HD95. 0: SSM (30), 1: SSM (30) + DL, 2: \(\sum _{i=1}^{d_0}\lambda _i\Phi _i\) (\(\lambda _i=1\)), 3: \(\sum _{i=1}^{d_0}\lambda _i\Phi _i\), 4: \(\bar{S}\) (50), 5: \(\bar{S}\) (30), 6: \(x_j\), 7: DL, 8: [35]

The results bear important implications: (1) Using a single skull image \(x_j\) as the template for shape warping can produce reasonable results, qualitatively and quantitatively (third row, Table 1, and Fig. 3E). However, it should be noted that the conclusion applies only to this specific task, where the region of interest (ROI), i.e., the cranium, is structurally simple. Using a single image as the template would likely fail on more complex facial structures. (2) Shape template derived from a single shape (\(x_j\)) produces comparable results to that from a mean shape averaged from 30 (\(\bar{S}\) (30)) or 50 (\(\bar{S}\) (50)) images. Figure 3 gives an example of the results obtained using shape warping. We can see that \(\bar{S} (30)\) (Fig. 3B) shows no noticeable difference on the cranium compared to \(x_j\) (Fig. 3D). As a result, subtracting the input from the templates (Eq. (7)) produces similar implants. The main difference lies in the facial area and the interior subtle structures (Fig. 3C and E). The reconstruction accuracy depends largely on how well the target matches with the template on the ROI (e.g., cranium or facial bones) during the warping and registration process. It is relatively easier to register different craniums than the facial structures from different subjects. In a facial reconstruction task, using a mean facial shape (e.g., Fig. 3B) would potentially produce more accurate reconstruction compared to using a single image. (3) In comparison to the deep learning-based approaches [32, 38], the shape warping- and SSM-based methods achieve inferior results on synthetic defects quantitatively. However, it should be noted that both [38] and [32] used an intensively augmented dataset during training, while only 30 images were used to build the SSM.

Fig. 3
figure 3

A: the input defective skull. B: the mean skull (\(\bar{S}\) (30)). C: the subtraction between B and A. D: the reference skull \(x_j\). E: the subtraction between D and A

MUG500+

This section presents the implant generation results on the craniotomy skulls from the MUG500+ dataset [37]. Figure 4 shows the 3D Slicer-based manual processing procedure of an implant (Fig. 4(A)) generated by subtracting the input defective skull from the skull reconstructed by SSM (30). First, a median smoothing filter is applied to the subtraction result to partially disconnect the implant from the noise (Fig. 4B). The smoothing kernel size should be chosen individually based on the specific case. Second, the scissors functionality is used to erase the delineated area (Fig. 4C) to fully remove the noise and extract the implant. Figure 4(D) shows the final implant. Step (B) and (C) are done manually using 3D Slicer (https://www.slicer.org/) [40]. Alternatively, the implant can be extracted automatically by applying morphological opening and connected component analysis consecutively to the subtraction result. However, it is recommended to follow the manual post-processing workflow in Fig. 4 for an optimal outcome. Figure 5 presents the automatically generated implants for some large and complex defects in the MUG500+ dataset. The implants are generated by \(\bar{S}\) (50) and manually post-processed according to Fig. 4. We can see that some of the defects are large, extending across almost half of the cranium and having rather irregular shapes. Nonetheless, the defects are still satisfactorily reconstructed. Notably, Fig. 1 (the teaser image) shows that the method is still effective when multiple large defects exit on the craniotomy skull. The completed skulls obtained according to Eq. (8) preserve the anatomical aesthetics of normal human skulls. The MUG500+ dataset also provides manually designed implants (i.e., surface models in .stl format) for the 29 craniotomy skulls. We convert the manual designs to images (.nrrd) for a quantitative comparison with our SSM-based methods. The results are given in Table 2. Since our work is the first to use the craniotomy dataset for evaluation, results from deep learning-based approaches are not available. Therefore, we only calculate and report the results from \(\bar{S}\) (50). Note that the quantitative scores should be interpreted with care, as they only reflect how well the reconstructions match with the manual designs rather than their actual clinical feasibility [30].

Fig. 4
figure 4

Manually extract the implants from the subtraction results using 3D Slicer. A The subtraction result. B-D The post-processing results of an implant shown in two different views. The last row shows how the final implant aligns with the skull

Fig. 5
figure 5

Exemplary results (from \(\bar{S}\) (50)) on the craniotomy skulls from the MUG500+ dataset. The 29 generated implants can be downloaded at https://doi.org/10.6084/m9.figshare.19328816.v3

Table 2 Quantitative results (produced by \(\bar{S}\) (50)) for the MUG500+ craniotomy dataset

Task2 of the AutoImplant II Challenge

We also apply the SSM-based method on the 11 clinical defective skulls from Task 2 of the AutoImplant II challenge. As described by Ellis et al. [30], the implant designs were quantitatively compared to reconstructions from postoperative imaging of the actual implant the patients have received for treatment. Table 3 shows these quantitative results from \(\bar{S}\) (50) and SSM (30), as well as from the AutoImplant II submissions. The \(\bar{S}\) (50) and SSM (30) had the best Hausdorff 95 scores than all other submissions but scored worse than some other submissions in the dice similarity and boundary dice similarity scores. Figure 6 shows a qualitative comparison. The first row in Fig. 6 shows the reconstruction results for a large defect with complex lower borders that extend to the zygomatic bone. We can see that deep learning-based approaches either produce incomplete reconstruction (Fig. 6A) or fail to cover the complex lower borders (Fig. 6B, C). In contrast, our method shows improved reconstruction in terms of completeness and border consistency (Fig. 6D). An advanced registration method that accurately aligns the corresponding anatomical landmarks between the template and target skull or a surface extrapolation method that guarantees smooth surface transition, as presented in [41, 42], is required to further improve border consistency, especially when it comes to complex and irregular defects.

Fig. 6
figure 6

Qualitative comparison between different implant design methods on Task 2@AutoImplant II. A [32], B [35], C [31], D Ours. The 11 generated implants can be downloaded at https://doi.org/10.6084/m9.figshare.19328816.v3

Table 3 Quantitative results for Task 2 of the AutoImplant II challenge

However, comparing the implant designs to the reconstructions of the implants from the postoperative CT imaging do not necessarily serve as a reliable metric for the quality of the implant designs [30]. For this reason, the implant designs were also qualitatively evaluated by an experienced neurosurgeon. The implant designs were judged based on completeness, false positive area, fit, and overall feasibility as described by Ellis et al. [30]. As shown in Table 4, the \(\bar{S}\) (50) implant designs had better overall feasibility, better fit, and less false positive area than the submissions from the Autoimplant II challenge. While none of the submissions from the Autoimplant II challenge were deemed feasible without modifications, 4 out of 11 of the \(\bar{S}\) (50) designs were deemed feasible with only minor flaws. Therefore, the \(\bar{S}\) (50) designs represent a substantial improvement in the clinical feasibility of implant designs compared to the deep learning-based challenge submissions. The main issues of the implant designs from \(\bar{S}\) (50) were that they did not always extend far enough in the superior direction to fully restore the natural skull shape and that the implants were often too thick.

Table 4 Qualitative evaluation scores for Task 2 of the AutoImplant II challenge by neurosurgeon MRA. The scores have been normalized such that 0 is the lowest possible score and 1 is the highest possible score. Completeness (Comp) evaluates the amount of the defect that the implant design covers. False positive area (FPA) evaluates the amount of amount of implant design outside of the defect area. Fit evaluates the shape of the implant design relative to the defect. Feasibility evaluates whether the implant design could be used in surgery. See Ellis et al. for the qualitative analysis methods [30]

Discussion

In automatic cranial implant design, deep learning-based approaches that rely on a defect-complete or defect-implant pair for training often fail to generalize to large and complex cranial defects in the test set, since the synthetic defects used during training have different distributions to the real clinical defects during evaluation. One popular solution to this problem is intensive data augmentation: augment the defects [26, 43] and/or augment the skull images [32, 38]. While data augmentation has shown to be effective to the generalization problem, the computational cost is substantially increased. Besides, patient-specific cranial defects show considerable variations among individuals, and it is unlikely to cover every possible defect pattern through augmentation. The SSM-based approach can circumvent the defect-related generalization problem, as an SSM relies only on the complete skulls for training. Therefore, an SSM is insensitive to the changes in defect patterns in the test cases. One factor affecting the performance of an SSM is the registration accuracy. Since human craniums are structurally simple and topologically stable among individuals compared to the defects and facial bones, precise registration among different craniums is highly achievable. Therefore, an SSM or simply a shape warping-based approach is often sufficient for the cranial defect reconstruction task.

Conclusion

As an alternative to mesh-based SSM (i.e., point distribution model), we demonstrate in this paper that an SSM can be built directly on (volumetric) skull images represented as binary voxel grids. We evaluate the SSM-based method on three cranial defect reconstruction tasks, demonstrating the effectiveness and advantages of the methods on large and complex cranial defects, compared to learning-based approaches. Besides, the SSM-based methods are not dependent on large quantities of training data as deep learning-based approaches, making the proposed method highly scalable and applicable in a clinical setting.

There are two main limitations of the current method that remain to be addressed in future work: (1) When the target and template skulls are not well aligned due to registration errors, non-trivial manual post-processing, as shown in Fig. 4, is required. Poor registration can also cause discontinuities or incompleteness at the junction of the implant and skull surface (e.g., Fig. 6B, C), which undermines the clinical feasibility of the reconstructed implants. Future work could adopt more advanced registration or surface extrapolation methods, as discussed in [41, 42], to further improve the quality of implant reconstruction. (2) Combining the advantages of both SSM and deep learning has not been explored in current work, which should be investigated in the future. SSM-based methods are robust to defect variations and can generate acceptable results stably for large and complex defects, while deep learning-based approaches can leverage large quantities of data to learn more anatomically plausible reconstructions. Integrating deep learning in SSM workflows could potentially improve both aspects.