Section “Ground-truth FLE” defines a probabilistic viewpoint on measurement processes and ground-truth-based FLE. Sampling strategies for crowd and single-person-based experimental FLE with and without fiducial orientation dependence are defined. Section “FLE estimation without ground-truth data” defines the “difference-to-mean” (dtm) estimator which does not use the ground-truth data. Section “Testing equality of the gt and dtm estimators” utilizes a distribution-free kernel-based two-sample hypothesis test [1] to check for significant statistical differences between different ground-truth-based reference estimators and their dtm counterparts. The specific measurement process for the experiment is defined in sections “Virtual phantom” and “Data collection”, where the generation of the phantom (virtual CT dataset) and data collection are explained.
Ground-truth FLE
In this section, we define various alternative interpretations of the FLE distribution, when ground-truth fiducial locations are available. These estimators are assumed to be the best possible estimators of the underlying fiducial localization error distribution. Several parameters (CT resolution, imaging energy levels, postprocessing and reconstruction filters, the fiducial material, size and geometry, etc.) determine the final information content of the dataset in which the localization is made. Other parameters (such as the number of repeated localizations, the fiducial markup software used, the screen resolution) are specific to the procedure with which the data collection is executed. All constant parameters of the imaging process and the measurement methodology are assumed to be implicitly encapsulated in a measurement process \(\mathcal {M}\) (e.g., the aforementioned imaging energy levels, postprocessing or reconstruction filters, the resolution and the fiducial materials and sizes, etc., are all process-specific parameters not directly modeled. They are treated as being constants in our investigation). The only explicit parameters of \(\mathcal {M}\) modeled are the fiducial set (number, location and orientation of fiducials) and the persons performing the measurements. The following variants of the ground-truth FLE measurement methodologies were differentiated:
In the generic case, a sample \(f \in \mathbb {R}^3\) is generated by a measurement process \(\mathcal {M}\) with a randomly chosen person p on a randomly chosen fiducial s at repetition r. The values of p and s are running over all possible persons and fiducials, respectively: \( f = \mathcal {M}\left( p, s, r \right) \).
The probability of sampling f from \(\mathcal {M}\) with uniform selection of s and p is assumed to follow the probability density function \(P_{\mathcal {M}}\)
$$\begin{aligned} f \sim P_{\mathcal {M}}(\cdot \vert p,s). \end{aligned}$$
(1)
Given the true position of fiducials \(\mathcal {G}_\mathcal {M} = \left\{ g_1, \dots , g_n \right\} , g_i \in \mathbb {R}^3 \) used in \(\mathcal {M}\), the ground-truth FLE estimator a 3D vector-valued function of a sample f for fiducial \(k \in \left\{ 1 \dots n \right\} \) is defined by
$$\begin{aligned} \widehat{{\hbox {FLE}}_{\mathrm{gt},k}}(f) := f - g_k. \end{aligned}$$
(2)
(2) maps the 3D point sample f to the error vector pointing from the true position of fiducial k to the acquired sample f. The probability distribution (1) induces a probability distribution on the ground-truth FLE vectors as well: \(\widehat{{\hbox {FLE}}_{\mathrm{gt},\cdot }(\cdot )}\) over the samples f coming from the “\(P_{\mathcal {M}}\) conditioned on fiducial k” distribution:
$$\begin{aligned} P_{{\hbox {FLE}}_{\mathrm{gt},k}} = P \left( \widehat{{\hbox {FLE}}_{\mathrm{gt},k}}(f) \vert f \sim P_{\mathcal {M}}(\cdot \vert p, s = k) \right) . \end{aligned}$$
\(P_{{\hbox {FLE}}_{\mathrm{gt},k}}\) is therefore the distribution of the errors that occur when the person is randomly chosen for each sample measuring the same fiducial, i.e., the conditional probability of the error vector given the fiducial k. For any given k
\({\hbox {FLE}}_{\mathrm{gt},k}\) defines a distribution of (relative) error vectors; therefore, conditioning on s can be interpreted as conditioning on a specific fiducial orientation; to ensure that this holds the test datasets defined all fiducials with a unique orientation. Therefore, \(P_{{\hbox {FLE}}_{\mathrm{gt},k}}\) is the orientation-dependent version of ground-truth FLE distribution. Assuming that the samples contain enough different orientations, marginalizing over k gives the fiducial orientation-independent ground-truth FLE distribution
$$\begin{aligned} P_{\mathrm{FLE}_\mathrm{gt}} \approx \frac{1}{n} \sum _{k = 1}^{n} P_{\mathrm{FLE}_{\mathrm{gt},k}}. \end{aligned}$$
(3)
\(P_{{\hbox {FLE}}_\mathrm{gt}}\) is the most generic FLE distribution since it depends neither on person nor on orientation. Person-independent formulations will be referred to as “crowd-based” as they need samples from several individuals.
Since it is only possible to have a finite number of samples from the underlying \(P_{{\hbox {FLE}}_\mathrm{gt}}\) distribution, it is impossible to exactly determine it. It is possible however to approximate it with measurements by repeatedly localizing all fiducials with a multitude of participants in a test set containing multiple fiducials with different orientations.
Conditioning \(P_{{\hbox {FLE}}_{\mathrm{gt},k}}\) on a person p leads to an orientation-dependent and person-specific FLE estimator (\(P_{\mathrm{FLE}_{\mathrm{gt},k,p}}\)). Conditioning (3) on person p results in the orientation-independent person-specific FLE, \(P_{{\hbox {FLE}}_{\mathrm{gt},p}}\). The estimated distributions resulting from these estimators are the best possible estimations of the underlying error distributions that we can achieve with finite sampling; therefore, they will be used as reference estimations of \({\hbox {FLE}}_\mathrm{image}\).
FLE estimation without ground-truth data
This section defines FLE estimation to the practical case when ground-truth fiducial locations are not available in the image dataset. This is the typical case for clinical datasets. The simplest approach [11] is to assume that the measurement process has no bias (the statistical expectation \(\mathcal {E} \left( {\hbox {FLE}}_\mathrm{gt}(f) \right) = 0\)).
From (2) by replacing \(g_k\) (the ground-truth knowledge on fiducial k) with the mean of the samples measuring fiducial k (\(\overline{f}_k\)) the dtm estimate of the FLE error vector is given by
$$\begin{aligned} \widehat{{\hbox {FLE}}_{\mathrm{dtm},k}}(f) := f - \overline{f}_{k}. \end{aligned}$$
(4)
With the help of (4) the estimators to all FLE distributions of section “Ground-truth FLE” can be constructed simply by replacing \(\widehat{{\hbox {FLE}}_{\mathrm{gt},k}}\) with \(\widehat{{\hbox {FLE}}_{\mathrm{dtm},k}}\) in the equations. These estimators are practical because they can be used on any real dataset as well as they do not depend on the ground-truth locations.
Testing equality of the gt and dtm estimators
The dtm estimator is only useful if the estimated distribution closely captures the underlying distribution (i.e., the “real” \({\hbox {FLE}}_\mathrm{image}\) distribution). Since this underlying distribution is unknown, the best we can hope for is that no statistically significant difference can be found between the ground-truth-based reference distribution (which is the best available unbiased estimation of the underlying distribution \({\hbox {FLE}}_\mathrm{image}\) given the samples) and the dtm-based estimation. Since these distributions are unknown and may differ, a distribution-free two-sample test is needed, where no ordinality of the samples is required, and which tests all moments. Such a test is provided by [1] for the equivalency of distributions using the maximum mean discrepancy metric. The MATLAB code for this test is available from the authors at http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm. For the tests, the usual \(\alpha = 0.05\) error level was chosen. The dtm estimator is considered unreliable if the test rejects the null hypothesis that \({\hbox {FLE}}_{\mathrm{dtm}(,*)} = {\hbox {FLE}}_{\mathrm{gt}(,*)}\). Although the lack of evidence for rejection does not prove that the two distributions are equal, it is a clear indicator that they are very similar.
Virtual phantom
In order to collect the required samples to estimate both ground-truth FLE and dtm FLE, a virtual phantom was created. A micro- CT scan of a titanium screw (1 mm \(\times \) 3 mm) was used to represent the fiducial geometry. It was scanned in high resolution using a Scanco vivaCT 40 \(\mu CT\) (Scanco Medical AG, Switzerland) device at 70 kV, with an image matrix of 2048 \(\times \) 2048 pixels and 1000 projections using an isotropic 10.5 \(\upmu \)m voxel size. The isosurface was thresholded to titanium; the segmentation and mesh generation were done in 3D Slicer [12]. The origin of the mesh was placed at the desired target position where the tracker probe tip is expected to touch the fiducial (Fig. 1). The resulting STL mesh was oriented and positioned into nine different locations in a Blender (www.blender.org) scene. The orientations were randomly chosen but similarly to earlier plastic skull (Fig. 2) phantom experiments [11]. The virtual phantom contained only the virtual screws at these random orientations and positions, and their density was set to match titanium (Fig. 3, right panel).
The CONRAD framework [10] was used to simulate a CT scan of the phantom. The resulting image set was converted to DICOM format (0.4 \(\times \) 0.4 \(\times \) 1 \({\hbox {mm}}^3\) voxel size) and was imported into 3D Slicer. The screws were annotated and segmented. The resulting scene was used with 3D Slicer throughout the study. The ground-truth positions and orientations of the screws were calculated and saved as reference values. Samples from the virtual CT dataset and the virtual scene are shown in Fig. 3.
Table 1 Number of samples used for the different types of estimators
Data collection
A group of ten individuals (experts and nonexperts) participated in data collection. The experiment consisted of ten repetitions. At each repetition, the participants were asked to localize nine fiducials in the image set. In order to minimize the effect of short-term memory (repeating the same pixel measurement instead of fully reprocessing the dataset), the persons were encouraged to schedule a few hours break between repeated sessions. The participating persons applied the same methodology used on real CT datasets and had no access to the ground-truth locations. Each individual localized each of the 9 screws 10 times, giving 90 samples; in total, 900 samples were obtained.
Table 2 The first two moments of the ground-truth crowd-based orientation-independent FLE (\({\hbox {FLE}}_\mathrm{gt}\)) and its dtm estimator (\({\hbox {FLE}}_\mathrm{dtm}\))
The definitions of ground-truth and dtm FLE were evaluated, resulting in a set of 3D vectors for each different FLE estimator. Table 1 shows the number of samples used in estimating the various FLE types.