1 Introduction

In recent years, an increasing number of challenges are organized at international conferences in medical image analysis and computer vision as more and more imaging data sets are accessible [1,2,3]. These open and public challenges provide an ideal forum for researchers in both academia and industry to participate, with the goal of gaining a better understanding of the performance of various algorithms on a specific image analysis task. However, defining a set of metrics to evaluate a particular image analysis algorithm is a non trivial problem. In many cases, there are several metrics that need to be considered instead of only one. How to combine and balance the different metrics is an important issue, but usually the weights assigned to different metrics are chosen to be uniform or based on the organizers experience. This is not optimal, as changing to a different set of weights can lead to a different rank of all the participants. From the perspective of a participant, a given rank ordering declares a winner and awards prize and prestige. However, the spirit of a challenge is to better understanding the benefits and drawbacks of the various algorithms rather than produce a leaderboard ordering. Challenge results may have further reaching effects, for example on commercial product development, method refinement, or inspire the design of new algorithms.

In [4], the STAPLE algorithm takes a collection of segmentations as input and estimates true segmentations as well as a measure of the performance level of each input, which enables assessment of the performance of automatic image segmentation methods. In [5], an evaluation of 14 nonlinear deformation algorithms was conducted by using three independent statistical analyses with overlap, volume similarity, and distance measures. A set of measures for the validation of diffusion tensor imaging (DTI) tractography is proposed in [6], and applied the proposed methods to evaluate atlas building-based tractography. The BRATS challenge was analyzed in [1] to explore the reason some algorithms worked better than others. The work of [7] proposes ordering metrics by their bias towards a set of defined segmentation properties. While [7] helps inform which metrics to include, it does not give a solution for combining metrics.

In this paper we combine a data-driven, unsupervised rank aggregation scheme with a perturbation based analysis of metric sensitivity to automatically compute weights for a set of metrics. Our method does not require normalization of metrics and makes no assumption about the distribution of metric values. Rather, the estimation of weights and corresponding rank ordering is determined entirely by the data and the specific image analysis task. We show on real anatomical data that by applying the proposed scheme, the final rank order may dramatically change, a result we hope will raise awareness in the community about shortcomings of current ad-hoc evaluation methods. Results demonstrate that the iterative procedure results in weights that reflect contributions of each metric in a plausible way, thus providing improved insight into overall rank aggregation. Our methodology provides transparency on aggregation that may help future challenge organizers to evaluate the best set of metrics beforehand based on existing data. We also advocate for our method as an exploratory tool, as the resulting weight for each metric, each usually representing a different aspect of similarity (overlap, surface distance, sensitivity to outliers, etc.), provides essential information for algorithm assessment. Finally, we reiterate that the motivation of our work is to better understand the performance of various algorithms, not to produce a de facto leaderboard ordering.

2 Methodology

The principal behind this rank aggregation scheme, introduced in [8], is that reliable metrics rank submissions in a similar manner. Metrics that produces an ordering that tend to agree with a collection of other metrics will be given a higher weight. Conversly, metrics that produce inconsistent rankings will receive lower weights. Central to the concept of consistent orderings is the stability of metrics. We propose to measure sensitivity of metrics to small perturbations of the input data. The insight is that metrics that are robust to small changes to input should receive a higher weight, and those metrics that produce different orderings under perturbations should receive lower weights.

2.1 Rank Aggregation

Let \(G = (g_1, ..., g_{N_{g}})\) represent a number of ground truth data for \(N_{g}\) tasks. For example, these could be segmentations for \(N_g=5\) different cases. A single ground truth could be assembled from several expert sources using label fusion, STAPLE [4], or a custom algorithm [1]. Let \(\mathbf {X} = (X_1, ..., X_{N_{S}})\) be \(N_S\) submissions to the challenge, where each submission \(X_i = (x^i_1, ..., x^i_{N_{g}})\) represents a set of \(N_G\) items to be directly compared to the set of ground truth data G. Let \(\mathbf {M}=(M_1, ..., M_{N_M})\) represent a collection of metrics, where for a given \(g \in G\) and \(x \in X_i\), a metric M(gx) returns a scalar value. Let \(\mathbf {R} = (R_1, ..., R_{N_M})\) be the ordinal rankings (ranking functions) corresponding to metrics \(\mathbf {M}\) evaluated on all submissions. For example, \(R_i\) is the ordinal ordering (1, 2, ..., \(N_S\)) of all the submissions under metric function \(M_i\).

We require an aggregate ranking function \(A(\mathbf {R}, G, \mathbf {X}) = \sum _{i=1}^{N_M} w_i R_i(G, \mathbf {X})\) as linear combination of orderings \(R_i\) (given by metric \(M_i\)) to produce an overall ranking. This linear combination is parameterized by weights \(W = (w_1, ..., w_{N_M})\) which can be thought of as a probability density function (\(\sum _i^{N_M} w_i = 1\)).

For a given submission item x and its corresponding ground truth g, we can compute the average ranking across all metrics defined as

$$\begin{aligned} \mu (g, x) = \frac{\sum _{i=1}^{N_M}R_i(g,x)}{N_M}. \end{aligned}$$
(1)

The mean value \(\mu (g,x)\) value can then be used to capture the variance of any individual ranking \(R_i\) under metric \(M_i\) by \(\sigma _i(g,x) = [R_i(g,x) - \mu (g,x)]^2.\) A small value of \(\sigma _i\) suggests that metric \(M_i\) produces an ordering in agreement with the other metrics and should be given a higher weight, while large values of \(\sigma _i\) represent disagreement from the majority and will be receive a lower weight. Note that the computation of mean and variance here is with respect to the orderings given by the metric, not the metrics themselves. That way, no normalization of metrics is required, and no assumptions about the distribution of metric values is assumed.

We can then pose this as an optimization problem, to find weights which minimize \(\sigma _i\) over all submissions:

$$\begin{aligned} \underset{W}{\text {argmin}} \sum _{g \in G} \sum _{x \in \mathbf {X}} \sum _{i=1}^{N_M} w_i \sigma _i(g,x) \end{aligned}$$
(2)

with the constraint that \(\sum _i^{N_M} w_i = 1\) and \(0 \le w_i \le 1\). The gradient with respect to a given \(w_i\) is

$$\begin{aligned} \nabla _{w_i} = [R_i(g,x) - \mu (g,x)]^2, \end{aligned}$$
(3)

which can be used to derive a gradient descent scheme [8], summarized in Algorithm 1.

figure a

Example: Consider a synthetic example where the true ranking is {1, 2, ..., 15}. We have 10 total ranking functions of varying accuracy, summarized in Table 1. The first three ranking functions give the correct ordering. The next three ranking functions return a list close to the correct ordering, created with 5 random swaps of adjacent items. The next two ranking functions are unreliable, with 5 random swaps of any two items in the ordering. The final two ranking functions are purely random orderings.

The method detects that the first three ranking functions are the most consistent and assigns the highest weight of 0.176. The next three ranking functions are less consistent and receive a slightly lower weight of 0.168, 0.146, and 0.146. The next two ranking functions are inconsistent and are given weights 0.002 and 0.008. The final two random ranking functions are correctly given weights of 0. With the estimated weights, the correct ordering of {1, 2, ..., 15} is produced. The ranking with uniform weights is the incorrect ordering {3, 2, 1, 4, 5, 7, 8, 6, 10, 9, 12, 11, 13, 15, 14}.

Table 1. Synthetic example where true ranking is {1, 2, ..., 15}. Ranking functions 1,2,3 are perfect, 4,5,6 contain small errors, 7,8 contain large errors, and 9 and 10 are random.

Correlation: Inherent to the task of evaluating segmentation is the problem of metric selection and correlation. Indeed, similarity metrics are often highly correlated, or nearly identical in the case of dice and Cohen’s kappa. As our method favors metrics which are in agreement with other metrics, we must address metric correlation. Guided by the work of [9], we carefully choose a collection of metrics to capture a wide range of metric properties while limiting the use of highly or perfectly correlated metrics. We include overlap based metrics dice, global consistency error, sensitivity, and specificity; surface based metric Hausdorff distance (95th percentile); information theoretic measure mutual information; and volume based measure volumetric similarity.

2.2 Assessing Stability with Perturbations

The traditional domains for rank aggregation, such as elections or meta-search, deal with ordinal rankings from the onset. That is to say, there is a collection of rankings provided by different ranking functions, but the inner workings of the ranking functions are either not available or not defined. In the meta-search example, the ranking functions are often proprietary, and for elections the ranking functions are based on personal preference. In these situations, the only recourse is to deal with the rankings directly.

In this work, we have the unique opportunity to systematically explore the ranking functions themselves. We propose to do this by assessing the stability and robustness of metrics by small perturbations to the input data. The intuition is that metrics that are robust to small perturbations provide more consistent rank ordering and should receive higher weights. Conversely, a metric where a small change in input data leads to a large change in the resulting ordering should be considered too sensitive to reliably discriminate differences, and should receive a lower weight. Combining perturbations with rank aggregation allows metric weights to reflect the sensitivity of the metrics on the specific image analysis task, completely determined by the data.

The method works by iteratively applying perturbations to ground truth data, and re-estimating weights using the scheme in Sect. 2.1, while keeping a running average of estimated weights. The necessary component is a method (or methods) to perturb ground truth data. A perturbation method could be general purpose deformations such as rigid transformations, or could be a custom algorithm designed with expert knowledge to accurate mimic anatomical variability. Whatever the method used to make small modifications to ground truth, the key is to produce a number of unique perturbations to fully probe each metric for reliability.

figure b

3 Experimental Validation

Data: We test our rank aggregation scheme on an artificial challenge to segment the corpus callosum, a flat bundle of fibers which connect the left and right hemisphere of the brain. The 2D contour of the corpus callosum is clearly visible in mid-sagittal slices from 3D brain MRI. Our data consists of 10 unique subjects (2D sagittal slices) that are repeated 3 times each to form a dataset of 30 images, where the image ordering is randomly permuted. Submitters were asked to manually outline the 30 corpus callosum structures using \(\texttt {itksnap}\) [10], without knowledge that it was 10 subjects repeated 3 times each. In total, 6 submitters provided outlines, which can be considered 18 unique submissions by taking into account the repeated nature of the data. For evaluation, ground truth segmentations were obtained by a deformable active contour model [11]. An example corpus callosum segmentation is shown in Fig. 1.

Fig. 1.
figure 1

Corpus callosum shown in red.

Metrics: To evaluate each submission with respect to ground truth, we employ several metrics discussed in Sect. 2. We include dice, global consistency error, sensitivity, and specificity; Hausdorff distance (95th percentile); mutual information, and volumetric similarity. The metrics were chosen to capture a wide range of metric properties while limiting the use of highly or perfectly correlated metrics, as shown in [9].

Perturbations: We implement 3 perturbation methods. For modeling linear transformations, we use rigid perturbations with a specified amount of random translation, rotation, and scaling. To model submissions who might over or under segment, we use morphological perturbations which randomly iterate between dilation and erosion. Finally, to model nonlinear differences from ground truth, we use B-spline perturbations with randomness controlled through a Gaussian distributed random sampling for B-spline parameters. For each iteration, a random perturbation method is chosen.

Table 2. For the corpus callosum challenge, the overall ranking using uniform weights compared to the weights estimated from rank aggregation with perturbations.

Results: We explore our rank aggregation with perturbations framework by considering 18 submissions of the 10 corpus callosum segmentation tasks. The submissions are named A–F with suffix denoting the 3 repeated segmentation tasks. The left side of Table 2 shows the final overall ranking using naive uniform weights for each metric. Our proposed method estimates weights: dice = 0.33, mutual information = 0.21, specificity = 0.20, volumetric similarity = 0.12, Hausdorff distance = 0.10, sensitivity = 0.03, and global consistency error = 0.01. The overall ranking under the computed weights is shown on the right of Table 2. It is interesting to note that the estimated weights dramatically changed the overall order as compared to uniform weights. In this case, global consistency error and sensitivity produce inconsistent orderings under perturbations and receive a low weight. The distribution of weights has the potential to provide important insight into why certain algorithms perform well on a given medical imaging task, which is the true spirit of grand challenges. Such feedback may serve to inform algorithm refinement, or help steer new algorithm development. For example, we may gain insight that a particular problem is better solved by a method based on intensities, contrast, shape models, or physical models.

Fig. 2.
figure 2

Evolution of weights vs. the number of iterations of perturbations.

We also explore how the number of perturbations influences the final estimated weights, as well as the magnitude of perturbations. For “small” perturbations, we set rigid scale parameters to 2 pixels translation, 5\(^{\circ }\) rotation, and \(5\%\) scale, morphology parameters to 5 iterations, and B-spline variance to 2.0. For “large” perturbations, we set rigid scale parameters up to 10 pixels translation, 30\(^{\circ }\) rotation, and \(50\%\) scale, morphology parameters to 10 iterations, and B-spline variance to 15.0. Figure 2 summarizes the results of these experiments. For this experiment, large perturbations seem to provide more separation between metric weights, particularly increasing the weight of specificity and increasing the relative importance of dice. We also observe faster convergence to stable weights under small perturbations, as large perturbations introduce more variability in orderings.

4 Conclusion

We have presented a method to automatically calculate weights for a set of metrics which probes the sensitivity of the metrics by exploring changes in rank due to perturbations to input data. Our method is completely data-driven, requiring no metric normalization procedures. We showed how our estimated weights can result in a vastly different ordering compared to uniform weighting. This has the potential to better inform organizers about the results, and provide additional insight into the performance of competing algorithms. For example, the distribution of weights and corresponding ranking changes may provide a clue that a particular problem is better solved by a method based on intensities, contrast, shape models, or physical models. Correlation is currently handled by careful selection of metrics. What remains is to automatically select the best metrics in addition to their weight, perhaps by integrating the work of [7]. Future work will explore and validate our method on data from a public challenge.