Introduction

In the last decade, research on artificial intelligence has seen rapid growth with deep learning models, by which various computer vision tasks got successfully automated through accurate neural network classifiers [1]. Evaluation procedures or quality of model performance are highly distinctive in computer vision between different research fields and applications.

The subfield medical image segmentation (MIS) covers the automated identification and annotation of medical regions of interest (ROI) like organs or medical abnormalities (e.g. cancer or lesions) [2]. Various novel studies demonstrated that MIS models based on deep learning revealed powerful prediction capabilities and achieved similar results as radiologists regarding performance [1, 2]. Clinicians, especially from radiology and pathology, strive to integrate deep learning based MIS methods as clinical decision support (CDS) systems in their clinical routine to aid in diagnosis, treatment, risk assessment, and reduction of time-consuming inspection processes [1, 2]. Throughout their direct impact on diagnosis and treatment decisions, correct and robust evaluation of MIS algorithms is crucial.

However, in the past years a strong trend of highlighting or cherry-picking improper metrics to show particularly high scores close to 100% was revealed in scientific publishing of MIS studies [3,4,5,6,7]. Studies showed that statistical bias in evaluation is caused by issues reaching from incorrect metric implementation or usage to missing hold-out set sampling for reliable validation [3,4,5,6,7,8,9,10,11]. This led to the current situation that various clinical research teams are reporting issues on model usability outside of research environments [4, 7, 12,13,14,15,16]. The use of faulty metrics and missing evaluation standards in the scientific community for the assessment of model performance on health-sensitive procedures is a large threat to the quality and reliability of CDS systems.

In this work, we want to provide an overview of appropriate metrics, discuss interpretation biases, and propose a guideline for properly evaluating medical image segmentation performance in order to increase research reliability and reproducibility in the field of medical image segmentation.

Main text

Evaluation metrics

Evaluation of semantic segmentation can be quite complex because it is required to measure classification accuracy as well as localization correctness. The aim is to score the similarity between the predicted (prediction) and annotated segmentation (ground truth). Over the last 30 years, a large variety of evaluation metrics can be found in the MIS literature [10]. However, only a handful of scores have proven to be appropriate and are used in a standardized way [10]. This work demonstrates and discusses the behavior of the following common metrics for evaluation in MIS:

  • F-measure based metrics like Dice Similarity Coefficient (DSC) and Intersection-over-Union (IoU)

  • Sensitivity (Sens) and Specificity (Spec)

  • Accuracy / Rand Index (Acc)

  • Receiver Operating Characteristic (ROC) and the area under the ROC curve (AUC)

  • Cohen’s Kappa (Kap)

  • Average Hausdorff Distance (AHD)

In detail descriptions of these metrics are presented in the Appendix. The behavior of the metrics in this work is illustrated in Fig. 1 and Fig. 2 which demonstrate the metric application in multiple use cases.

Fig. 1
figure 1

Demonstration of metric behavior in the context of different-sized ROIs compared to the total image. The figure is showing the perks of F-measure based metrics like DSC as well as IoU and the inferiority of Rand index usage. Furthermore, the small ROI segmentation points out that metrics like accuracy have no value for interpretation in these scenarios, whereas the large ROI segmentation indicates that small percentage variance can lead to a risk of missing whole instances of ROIs. The analysis was performed in the following scenarios and common MIS use cases. Scenarios: No segmentation (no pixel is annotated as ROI), full segmentation (all pixels are annotated as ROI), random segmentation (full random-based annotation), untrained (after 1 epoch during training) and trained model (fully fitted model). Use cases: Small ROIs via brain tumor detection in magnetic resonance imaging and large ROIs via cell nuclei detection in pathology microscopy

Fig. 2
figure 2

Demonstration of metric behavior for a trained segmentation model in the context of different medical imaging modalities. The figure is showing the differences between metrics based on distance like AHD, with true negatives like Accuracy, and without true negatives like DSC. Each subplot illustrates a violin plot which visualizes the resulting scoring distribution of all testing samples for the corresponding metric and modality. For visualization purposes, AHD was clipped to a maximum of 250 (affected number of samples per dataset: dermoscopy 2.0%, endoscopy 0.3%, fundus 0.0%, microscopy 0.0%, radiology 0.5%, and ultrasound 2.5%)

Class imbalance in medical image segmentation

Medical images are infamous in the field of image segmentation due to their extensive class imbalance [10, 17]. Usually, an image in medicine contains a single ROI taking only a small percentage of pixels in the image, whereas the remaining image is all annotated as background. From a technical perspective for machine learning, this scenario entails that the model classifier must be trained on data composed of a very rare ROI class and a background class with often more than 90% or even close to 100% prevalence. This extreme inequality in class distribution affects all aspects of a computer vision pipeline for MIS, starting from the preprocessing, to the model architecture and training strategy up to the performance evaluation [18].

In MIS evaluation, class imbalance significantly affects metrics which include correct background classification. For metrics based on the confusion matrix, these cases are defined as true negatives. In a common medical image with a class distribution of 9:1 between background and ROI, the possible number of correct classifications is extensively higher for the background class compared to the ROI. Using a metric with equal true positive and true negative weighting results in a high-ranking scoring even if any pixel at all is classified as ROI and, thus, significantly biases the interpretation value. This behavior can be seen in metrics like Accuracy or Specificity which present always significantly high scorings in any MIS context. Therefore, these metrics should be avoided for any interpretation of segmentation performance. Metrics that focus on only true positive classification without a true negative inclusion provide better performance representation in a medical context. This is why the DSC and IoU are highly popular and recommended in the field of MIS.

Influence of the region-of-interest size on evaluation

The size of an ROI and the resulting class imbalance ratio in an image demonstrates an anti-correlation to evaluation complexity for interpretation robustness. In the medical context, the ROI size is determined by the type in terms of the medical condition and the imaging modality. Various types of ROIs can be relevant to segment for clinicians. Whereas organ segmentation, cell detection, or a brain atlas take up a larger fraction of the image and, thereby, represent a more equal background-ROI class ratio, the segmentation of abnormal medical features like lesions commonly reflects the strong class imbalance and can be characterized as more complex to evaluate. Furthermore, the imaging modality highly influences the ratio between ROI and background. Modern high-resolution imaging like whole-slide images in histopathology provides resolutions of 0.25 μm with commonly 80,000 × 60,000 pixels [19, 20] in which an anaplastic (poorly differentiated) cell region takes up only a minimalistic part of the image. In such a scenario, the resulting background-ROI class ratio could typically be around 183:1 (estimated by a 512 × 512 ROI in an 803 × 603 slide). Another significant class ratio increase can be observed in 3D imaging from radiology and neurology. Computer tomography or magnetic resonance imaging scans regularly provide image resolutions of 512 × 512 pixels with hundreds of slices (z-axis) resulting in a typical class ratio around 373:1 (estimated by a 52 × 52 ROI in a 512 × 512x200 scan) [19]. In order to avoid such extreme imbalance bias, metrics that are distance-based like AHD or exclude true negative rewarding like DSC are recommended. Besides that, patching techniques (splitting the slide or scan into multiple smaller images) are often also applied to reduce complexity and class imbalance [2, 20].

Influence of the segmentation task on evaluation

For valid interpretation of a MIS performance, it is crucial to understand metric behaviors and expected scores in different segmentation tasks. Depending on the ROI type like a lesion or organ segmentation, the complexity of the segmentation task and the resulting expected score varies significantly [21]. In organ segmentation, the ROI should be located consistently at the same position with low spatial variance between samples, whereas an ROI in lesion segmentation shows high spatial as well as morphological variance in its characteristics. Thereby, optimal performance metrics in organ segmentation are more likely to be possible, even though less realistic in lesion segmentation [22, 23]. This complexity variance implicates expected evaluation scores and should be factored in performance interpretation. Another important influencing factor in the segmentation task is the number of ROIs in an image. Multiple ROIs require additional attention for implementation and interpretation because not only high scoring metrics can be misleading and hiding undetected smaller ROIs between well predicted larger ROIs but also distance-based metrics are defined only on pairwise instance comparisons [21]. These risks should be considered in any evaluation of multiple ROIs.

Multi-class evaluation

The previous evaluation metrics discussed are all defined for binary segmentation problems. It is needed to be aware that applying binary metrics to multi-class problems can result in highly biased results, especially in the presence of class imbalance [6]. This can often lead to a confirmation bias and promising-looking evaluation results in scientific publications which, however, are actually quite weak [6]. In order to evaluate multi-class tasks, it is required to compute and analyze the metrics individually for each class. Distinct evaluation for each class is in the majority of cases the most informative and comparable method. Nevertheless, it is often necessary to combine the individual class scores to a single value for improving clarity or for further utilization, for example as a loss function. This can be achieved by micro and macro averaging the individual class scores. Whereas macro-averaging computes the individual class metrics independently and just averages the results, micro-averaging aggregates the contributions of each class for computing the average score.

Evaluation guideline

  • Use DSC as main metric for validation and performance interpretation.

  • Use AHD for interpretation of point position sensitivity (contour) if needed.

  • Watch out for class imbalance and avoid interpretations based on high Accuracy.

  • Provide next to DSC also IoU, Sensitivity, and Specificity for method comparability.

  • Provide sample visualizations, comparing the annotated and predicted segmentation, for visual evaluation as well as to avoid statistical bias.

  • Avoid cherry-picking high-scoring samples.

  • Provide histograms or box plots showing the scoring distribution across the dataset.

  • Keep in mind variable metric outcomes for different segmentation types.

  • Be aware of interpretation risks by multiple ROIs.

  • For multi-class problems, provide metric computations for each class individually.

  • Avoid confirmation bias through macro-averaging classes which is pushing scores via background class inclusion.

  • Provide access to evaluation scripts and results with journal data services or third-party services like GitHub [24] and Zenodo [25] for easier reproducibility.

Sample visualization

Besides the exact performance evaluation via metrics, it is strongly recommended to additionally visualize segmentation results. Comparing annotated and predicted segmentation allows robust performance estimation by eye. Sample visualization can be achieved via binary visualization of each class (black and white) or via utilizing transparent color application based on pixel classes on the original image. The strongest advantage of sample visualization is that statistical bias, overestimation of predictive power through unsuited or incorrect computed metrics, is avoided.

Experiments

We conducted multiple experiments for supporting the principles of our evaluation guideline as well as demonstrate metric behaviors on various medical imaging modalities. Furthermore, the insights of this comment are based on the experience during the development and application of the popular framework MIScnn [18] as well as our contribution to currently running or already published clinical studies [2, 26,27,28].

The analysis utilized our medical image segmentation framework MIScnn [18] and was performed with the following parameters: Sampling in 64% training, 16% validation, and 20% testing sets; resizing into 512 × 512 pixel images; value intensity normalization via Z-score; extensive online image augmentation during training, common U-Net architecture [29] as neural network with focal Tversky loss function [30] and a batch size of 24 samples; advanced training features like dynamic learning rate, early stopping and model checkpoints. The training was performed for a maximum of 1000 epochs (68 up to 173 epochs after early stopping) and on 50 up to 75 randomly selected images per epoch. For metric computation and evaluation, we utilized our framework MISeval, which provides implementation and an open interface for all discussed evaluation metrics in a Python environment [31]. In order to cover a large spectrum of medical imaging with our experiments, we integrated datasets from various medical fields: Radiology–brain tumor detection in magnetic resonance imaging from Cheng et al. [32, 33], ultrasound–breast cancer detection in ultrasound images [34], microscopy–cell nuclei detection in histopathology from Caicedo et al. [35], endoscopy–endoscopic colonoscopy frames for polyp detection [36], fundus photography–vessel extraction in retinal images [37], dermoscopy–skin lesion segmentation for melanoma detection in dermoscopy images [38].

Outlook

This work focused on defining metrics, their recommended usage and interpretation biases to establish a standardized medical image segmentation evaluation procedure. We hope that our guidelines will help improve evaluation quality, reproducibility, and comparability in future studies in the field of medical image segmentation. Furthermore, we noticed that there is no universal Python package for metric computations, which is why we are currently working on a package to compute metrics scores in a standardized way. In the future, we want to further contribute and expand our guidelines for reliable medical image segmentation evaluation.