Variability and reproducibility in deep learning for medical image segmentation

Renard, Félix; Guedria, Soulaimane; Palma, Noel De; Vuillerme, Nicolas

doi:10.1038/s41598-020-69920-0

Variability and reproducibility in deep learning for medical image segmentation

Article
Open access
Published: 13 August 2020

Volume 10, article number 13724, (2020)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Variability and reproducibility in deep learning for medical image segmentation

Download PDF

Félix Renard^1,2,
Soulaimane Guedria^1,2,
Noel De Palma¹ &
…
Nicolas Vuillerme^2,3

16k Accesses
110 Citations
7 Altmetric
Explore all metrics

Abstract

Medical image segmentation is an important tool for current clinical applications. It is the backbone of numerous clinical diagnosis methods, oncological treatments and computer-integrated surgeries. A new class of machine learning algorithm, deep learning algorithms, outperforms the results of classical segmentation in terms of accuracy. However, these techniques are complex and can have a high range of variability, calling the reproducibility of the results into question. In this article, through a literature review, we propose an original overview of the sources of variability to better understand the challenges and issues of reproducibility related to deep learning for medical image segmentation. Finally, we propose 3 main recommendations to address these potential issues: (1) an adequate description of the framework of deep learning, (2) a suitable analysis of the different sources of variability in the framework of deep learning, and (3) an efficient system for evaluating the segmentation results.

Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges

Article Open access 29 May 2019

Medical Image Segmentation Using Deep Learning

Advanced Approaches for Medical Image Segmentation

Introduction

Medical imaging plays a central role in medicine today because it can reveal the anatomy of the patient. However, to leverage the full potential of medical images, it is necessary to analyze them via image processing. One of the main clinical tools is image segmentation^{1, 2}. Medical image segmentation can be defined as an automatic (or semiautomatic) process to detect boundaries within a 2D or 3D image. It is based on information such as pixel intensity, texture and anatomical knowledge.The result of segmentation can then be used in further applications and in gaining insights²; examples include the quantification of tissue volumes^{3, 4}, diagnosis^{5, 6}, the localization of pathology^{7, 8}, the study of anatomical structure^{9, 10}, treatment planning¹¹, and computer-integrated surgery¹².

Manual medical image segmentation leads to two main issues: much time is needed for delineation, and reproducibility is called into question. First, the time needed to segment is incompressible, and it is correlated with the number and the size of images. Since the size of these two parameters is increasing due to the ease of facility access to medical imaging and the improvement of acquisition technologies, manual segmentation is becoming intractable. Second, reproducibility corresponds to the agreement between the results of multiple measurements of the data (here, the segmentation results) under the same methodology. In medical image segmentation, it is well known that there is inter- and intraoperator variability. The former relates to the observed differences in the segmentation results obtained by two different operators, while the latter relates to the observed differences between two results of segmentation tasks performed by the same operator at two different times. Due to the crucial role of segmentation in medical diagnostics and treatments, the reproducibility of the method is fundamentally important.

These two issues lead one to consider automatic segmentation. Automatic segmentation consists in determining a prediction model and its inherent parameters relative to a given class of problems (for example, the kind of imaging performed or organs imaged). These parameters can be divided into two classes: the hyperparameters associated with the model and the parameters estimated from the dataset. The aim of automatic segmentation is to estimate the best parameters to obtain highly accurate results over the training dataset while maintaining good generalization for other datasets of the same class of problem, also called “test datasets”. In other words, the algorithm must avoid perfectly fitting the training set with poor accuracy results on the testing set. This problematic phenomenon is also called “overfitting” (see page 108 of the book¹³).

The rapid development of new automatic segmentation algorithms since the 2000s is strongly connected to the rise of machine learning². During the last decade, a specific field of machine learning and artificial neural networks, called “deep learning” (DL)^{13, 14}, has outperformed classical segmentation methods¹⁵. A neural network with several hidden layers is considered a ‘deep’ neural network, hence the term ‘deep learning’¹⁴. This is the case for several reasons, for example, nonsupervised feature extraction via convolutional layers and the possibility of dealing with a very large dataset via efficient optimization methods such as backpropagation of the gradient (see chapter 6.5 of the book by Goodfellow et al.¹³). Several DL architectures have been applied to medical image segmentation, including fully convolutional networks (FCNs)¹⁶ and U-Net¹⁷ (see Litjens et al.¹⁵ for a recent review). FCNs¹⁶ are built from locally connected layers, such as convolution, pooling and upsampling layers. An FCN is composed of two main parts: the downsampling and upsampling paths. The downsampling path captures contextual information, whereas the upsampling path recovers spatial information. Moreover, skip connections between layers are performed to recover fine-grained spatial information that is potentially lost in the pooling and downsampling layers. U-Net¹⁷ is built upon FCNs. The main difference is that each downsampling scale is linked to the corresponding upsampling scale with a concatenation operator. In this way, each upsampling scale has the information of the corresponding downsampling scale and the lower upsampling scale, leading to better segmentation.

However, although DL algorithms perform well, they are complex. A number of factors may explain the variability in the obtained results: the intrinsic variability of the dataset, the stochastic process during optimization, the choice of the hyperparameters relative to the optimization and regularization processes, and the choice of the DL architecture itself. This variability in the different parts of the framework leads to some difficulties in analyzing the reproducibility and making comparisons between frameworks. In addition, this variability leads to numerous parameters and hyperparameters being set. Furthermore, as highlighted in Joelle Pineau’s reproducibility checklist¹⁸, provided during NeurIPS 2019, describing the DL methods becomes its own challenge for reproducibility. Moreover, the strategy for evaluating the segmentation results, and thereby the variability of the method, is complex. There are a plethora of metrics¹⁹ to analyze segmentations, leading to various ways of comparing the methods.

Along these lines, three main questions, at least, about variability and reproducibility can be formulated.

Question 1: Is there enough information in published articles in the field of medical image segmentation with DL to correctly reproduce the results?
Question 2: If the information is provided, has the variability in the several steps of the DL framework been considered?
Question 3: Does the evaluation system for the segmentation results correctly reflect this variability?

These three questions are crucial for the application and potentially the evaluation of the segmentation algorithms. After focusing on the concept of reproducibility in medical image segmentation and on how to consider the different sources of variability in DL, we will review the literature to provide an overview of the practice of reproducibility in the fields of medical image segmentation in DL, based on three main topics: (1) the description of the methods, (2) the analysis of variability and (3) the evaluation system. On the basis of this synthesis, we will propose recommendations to appreciate the results of new DL strategies.

Related work

In this section, we will broadly address the issues of the reproducibility and evaluation of segmentation in medical imaging. Then, we will outline several sources of variability in the DL framework that can lead to difficulties for reproducibility.

Reproducibility and evaluation of segmentation in medical imaging

Reproducibility is a popular topic in science²⁰. Hence, numerous articles^{21, 22} reveal a potential crisis of reproducibility in the different fields of science. Thus, most scientists have experienced a failure to reproduce results²¹ (more than 50% in the case of their own works in medicine, physics and engineering and more than 75% in the case of works by another person in the same fields).

In the rest of the article, we will follow the definition of the report of the National Academies of Science, Engineering, and Medicine²⁰: reproducibility means obtaining consistent results using the same input data, computational steps, methods, and conditions of analysis; it is synonymous with computational reproducibility. Moreover, this report²⁰ (recommendation 5-1, page 7) recommends that researchers should provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. These uncertainties include stochastic uncertainties.

Reproducibility can be assessed with different procedures. First, reproducibility can be analyzed by intraclass correlation (ICC), proposed by Shrout and Fleiss²³. The score obtained, which is between 0 and 1, indicating poor and perfect reproducibility, respectively, enables a comparison between intra-individual and inter-individual variabilities. Another statistical tool generalizing the ICC is analysis of the variance²⁴ (ANOVA). It provides a collection of tools focusing on the variability of the means among groups. One interesting point is that ANOVA can deal with multiple factors.

One of the main sources of variability in machine learning originates from the difference between the observed samples of the dataset and the real distribution of the dataset. The fact that the learning step of the algorithm is performed on only a part of the distribution can affect the reproducibility and particularly the replication of the results. A class of tools, called “cross-validation” (CV)²⁵, is available in studying this variability. A special focus on these methods is made in the next section, concerning variability in a dataset.

Moreover, segmentation in the specific field of medical imaging is complex in terms of reproducibility for several reasons. First, the available datasets are generally limited: the number of samples is generally less than 100 items. Then, each segmentation task must be considered with regard to the image modality (for example, whether it was obtained by MRI, scanner, or echography) and the organ studied²⁶. Furthermore, the masks in segmentation are usually generated manually. This leads to some intra- and inter-rater variability. Consequently, there is no certain truth but only a gold standard. Additionally, there are also several metrics to evaluate segmentation, such as the dice coefficient (DC) and the modified Hausdorff distance. Each metric focuses on a specific aspect of the segmentation¹⁹. For example, a metric can correctly reflect the good overlap between a segmentation mask and a gold standard, but it cannot highlight the smoothness of the contour. To correctly describe the quality of a segmentation, several metrics are necessary^{19, 26}. An adequate system of evaluation will permit accurate consideration of the variability in DL frameworks.

Variability in DL frameworks

In the next sections, five different kinds of variability are presented. The DL framework and its related sources of variability are displayed in Fig. 1.

Variability in the dataset

To infer a segmentation with a DL model (and more globally, a supervised machine learning model), the classic method consists in splitting the data sets into three parts. The first part corresponds to the “training set” for estimating the parameters of the model: it is composed of the raw data and corresponding labels. Based on the raw data, the DL algorithm infers some results that are compared to the labels. The DL parameters are then optimized to minimize the error between the results and labels. The second part is the “validation set”. It is more specific to the DL community. It estimates the unbiased error of the trained DL model. It permits the training of the DL to be stopped to avoid overfitting. It is not mandatory and is usually used in practice when the dataset has enough samples. Finally, the last part, called the “testing dataset”, provides an unbiased evaluation of the final model of the DL algorithm. The proportions of the different parts depend on the initial number of samples and can significantly affect the expected degree of generalization. Let us consider a trivial example, where only one sample is chosen for the testing set; the evaluation of the DL depends greatly on the selected sample. In the same way, selecting few samples for the training set leads the model to perfectly learn the training data.

To avoid bias in the data selection, strategies called “cross-validation” are performed. These strategies consist in dividing the dataset into several folds, then assigning these folds to the training, validation and testing sets. At the end of the DL model estimation and evaluation processes, the folds are reassigned for novel estimations and so on. The cross-validation strategies permit one to address variability in the data.

The number of parameters to estimate in a DL model is often larger than the number of images in the datasets. Moreover, in medical imaging segmentation, the heterogeneous appearance of the target organ (anatomical variability) or of the lesions (size, shape or position) poses a great challenge. One solution, called “data augmentation”²⁷, generates new samples by applying different transformations to the dataset (e.g., rotation or flipping). In this way, unseen target organs or lesions can potentially be approximated. However, this also adds sources of variability in the general framework, since there is no consensus on which transformation to perform and the parameters of the transformation are generally randomly chosen.

Variability in the optimization

This section focuses specifically on the variability of optimization with an already estimated and constant set of hyperparameters. One of the main factors of complexity is the very large number of parameters of the model to be estimated. Training these parameters in DL models is very challenging. Solving the optimization problem of estimating these weights is generally an extremely difficult task with a stochastic process.

Each weight in the DL algorithm corresponds to another parameter (which can be seen as another dimension) of the cost function of the optimization. DL models often have millions of parameters, making the search space to be evaluated by the algorithm extremely high dimensional, in contrast to classic machine learning algorithms. Moreover, the addition of each new dimension dramatically increases the distance between points in this high-dimensional space. Consequently, the search space is drastically increased. More precisely, the number of possible distinct configurations of a set of parameters increases exponentially as the number of parameters increases. This is often referred to as the “curse of dimensionality” (see page 155 of Goodfellow et al.¹³).

In addition, the cost function is generally nonconvex (see page 282 of Goodfellow et al.¹³). These facts lead to several issues: the presence of local minima and flat regions with the constraint of the high-dimensionality of the search space. The best general algorithm known for solving this problem is stochastic gradient descent (SGD) (see chapter 5.9 of the book¹³), where the model weights are updated at each iteration using the backpropagation-of-error algorithm. However, there is no guarantee that the DL estimation will converge to a good solution (or even a good local optimum), that the convergence will be fast, or that convergence will even occur at all²⁸.

Nevertheless, recent work may suggest that local minima and flat regions may be less challenging than previously believed^29,30,31. From Choromanska et al.²⁹, it appears that almost all local minima have very similar function values to the global optimum, and hence, finding a local minimum is sufficient. These last results have been obtained for classification tasks. Furthermore, the important convolutional step of segmentation is not considered in Choromanska et al.²⁹ or Dauphin et al.³⁰.

To the best of our knowledge, only one conference article³² addresses this issue of stochastic optimization uncertainties in medical imaging segmentation with DL. The authors show that DL models estimated several times with the same data show differences, but the results obtained on the evaluated metrics are not significantly different.

Variability in the hyperparameters

The hyperparameters correspond to the global settings of an algorithm. In machine learning, each parameter impacts the results differently³³. Several hyperparameters must be fitted before the training of the DL model, for example, the learning rate for optimization and the dropout percentage for regularization¹³.

There are different ways to set them. First, manual configuration is considered. This strategy limits the exploration space, but the computation time is relatively short compared to those of other methods since only a rough approximation of the best hyperparameters is expected. The second kind of strategy is based on automatic space exploration. The classic method, called “grid search”, tests every combination of hyperparameters. It will find the best set of hyperparameters, but the computational cost increases greatly with the number of hyperparameters. Another strategy, called “random search”, randomly samples the set of hyperparameters to be evaluated. This method generally cannot reach the optimum values, but approximates them in fewer iterations than grid search.

A new strategy³⁴, called “Bayesian optimization”, automatically infers a new combination of hyperparameters based on previous evaluations. In this case, the space exploration is intermediate and is driven by experience. The cost of exploration is lower than that in a grid or random search.

Variability in the DL architecture

Here, only the number of nodes, the number of layers, the kinds of layers (for example, convolutional, pooling, or dense) and the connections among the layers are considered in the architecture. Even with these four parameters, the number of available architectures is infinite.

In practice, only three strategies are chosen for the selection of the architecture. The first one consists in selecting a well-known DL model that has already proved its performance in previous work¹⁵, such as U-Net¹⁷ for image segmentation. This method is considered more often in clinical application fields. This method is not expected to provide the best architecture for a specific problem.

Another strategy consists in manually handcrafting the DL architecture. This leads to a plethora of architectures¹⁵. However, it does not guarantee the best architecture, and modifications of the tested architecture are generally not considered. The final strategy, also called “network architecture search”, is to automatically create a DL architecture through optimization for a specific task³⁵. The drawback of approximating the best architecture is a very high cost in time and resources. For instance, the network architecture search proposed in³⁶ tested 20,000 architectures in 4 days with 500 graphics processing units (GPUs).

The estimation of the minimal network architecture needed to achieve a certain segmentation accuracy on a given dataset can enable variability in the DL architecture to be avoided. However, as discussed in the review³⁷, this topic remains a challenge.

Variability in the middleware and the infrastructure

The last section focuses on algorithms relevant to DL. In this section, the possible variability due to the middleware and the infrastructure is considered. There are many toolboxes to implement a DL framework. To the best of our knowledge, no publication has addressed the problem of reproducibility in DL with regard to the middleware. Different implementations are compared, for example, by programming language, in terms of their capacity to use a GPU. A review of different implementations and their characteristics can be found in³⁸.

The learning phase in DL can be a very long process, considering the complexity of the architecture of the DL and the dataset size. As previously explained, the search for hyperparameters can also be prohibitive. To improve the processing time, several solutions based on the infrastructure are considered. Different kinds of infrastructures³⁹ can be used, such as a central processing unit (CPU), GPU, or tensor processing unit (TPU). However, some technical characteristics such as memory precision for different memory sizes can affect the accuracy of the results⁴⁰. Another example, the numerical operations performed on the GPU, can be nondeterministic, leading to nonreproducibility in the results⁴¹.

Another possibility for accelerating the processing time is choosing a parallel or distributed DL model. These techniques come with their own different methods that potentially impact the reproducibility of the outcome. For an overview of the parallel and distributed models and their own challenges, the interested reader can refer to^{42, 43}.

Methods

In this section, we first introduce how the literature review was performed, and then, we briefly describe the different metrics.

Literature review

There is no standard for the reproducibility or evaluation of DL in medical image segmentation. The aim of this review is to reflect common practices for DL in medical image segmentation. To fulfill this expectation, this review focuses on three goals: (1) to inspect how the methods are described to enable work to be reproduced, (2) to present the variety of methodology and highlight the variability among DL frameworks and (3) to outline the kinds of evaluations used in DL.

To observe the variability of the methodology and evaluations in the literature, we focus on the 23 articles presented in the review article¹⁵ in the specific section “Tissue/anatomy/lesion/tumor segmentation”. This review article was chosen because it was the most relevant found on Google Scholar (with the mandatory keywords ’medical image segmentation neural network’ and at least one keyword in ’review survey’) among more than 2300 hits on Google Scholar (in December 2019). All the considered articles propose recent strategies: the oldest one was published in 2014⁴⁴ and the mean year of publication is 2016. Moreover, the mean number of citations on Google Scholar (in December 2019) is \(232.3 \pm 308.2\) (median = 97, min = 20, max = 1074).

To obtain a more recent overview, we select 3 reviews of medical image segmentation methods^{37, 45, 46}. We focus specifically on how the problem of variability and reproducibility is addressed in the scientific literature.

We focus on the possible variability introduced by the data itself, by the optimization strategy and associated hyperparameters, by the middleware and the infrastructure, and by the evaluation measure. For all the inspected parameters or evaluations, we determine the presence of the terms and their potential values. This consideration is important for being able to reproduce the different works. When a framework is described, we determine whether the correct terms are used appropriately. To highlight this phenomenon, we consider the kind of algorithm used in the optimization strategy.

For the data variability, we consider whether the DL algorithm is tested on several datasets, whether they are public or private, the number of datapoints available, whether data augmentation has been performed, the proportion of training, validation and testing sets and the possible application of a cross-validation method. For the optimization, we examine whether different parameters are recorded (the optimization strategy, learning rate, batch size, and presence of dropout regularization). We also investigate whether the hyperparameters of the optimization are hand-crafted or automatically optimized (and whether this information is available). For the middleware and infrastructure considerations, we report whether these details are provided. Special attention is also paid to the implementation of the DL model and the processing unit considered. We also determine whether the calculations are performed on a distributed system, which can be a large source of variability itself. For the evaluation, we consider the number and kinds of measures, and whether the variability of the results is described (the presence of standard deviations).

Metric evaluation

The evaluation of the different estimations of DL models is assessed with the DC, the true positive rate (TPR), also called the sensitivity (Sens.), the true negative rate (TNR), also called the specificity (Spef.), and the average volume distance (AVD) (linked with the Hausdorff distance). We chose these metrics because they often appeared in the articles of the literature review. The different metrics are described in Table 1¹⁹. We consider various metrics, since each metric has its drawbacks, and evaluate only a part of the segmentation problem^{19, 26}. Readers interested in additional metrics and the interactions among them can read the study of Taha et al.¹⁹.

Table 1 Segmentation metrics

Full size table

Synthesis of the literature review

The main results are displayed in Tables 2, 3, 4 and 5. Table 2 focuses on the data variability. Table 3 focuses on the evaluation procedure. Table 4 presents the optimization strategies. Table 5 considers the middleware and the infrastructure. We are interested in the following three main points: (1) whether the DL strategy is correctly described as enabling the work to be reproduced, (2) whether the variability of the different parts of the DL framework are considered, and (3) how the evaluation is performed and the results are reported.

Description of the DL strategy

In this section, we focus not on the fact that some methods are performed and some are not, but on whether the methods are clearly described. It can be seen that a method may have been applied without any mention in the text.

The main findings are as follows: only two articles^{47, 48} (9% of the articles) sufficiently describe the hyperparameters and the dataset to enable the work to be reproduced. One study⁴⁹ has just one hyperparameter missing (the batch size) in the text, but the source code is available with this information included. Here, we focus on descriptions relative to the dataset and to the optimization stage. These results are detailed in Fig. 2. The left side of the figure is relevant to the description of the dataset (the training proportion, the data augmentation and the validation set) and the right side to the description of the optimization (the optimization procedure, the learning rate, the dropout procedure and the batch size). Some criteria are described well, such as the training proportion (83% of the considered articles) or the optimization procedure (83% of the selected articles). However, some characteristics are less available, such as the procedure of data augmentation (only 35% of the articles). To obtain a reproducible study, all these characteristics must be described. Only 9% of the selected articles provide sufficient information to be reproducible.

In Table 2, the dataset management method is described. All the selected articles correctly present the dataset and the number of samples. 17% of the articles do not explain the training proportion used to estimate the parameters. Only 57% of the selected articles clearly state whether they used a validation set, and 35% whether they performed data augmentation.

Table 4 focuses on the hyperparameters of the optimization process. 17% do not explain the optimization procedure at all. One⁴⁴ cites a generic name (GDM, for gradient-based method) without any explanation. The learning rate parameter is generally present with its initial values (or range of values). Four articles do not mention the values of the parameters. For the specific AdaDelta optimization used in^50,51,52, there is no learning rate. However, some coefficients need to be specified, such as the sensitivity ratio. Only one article⁵² of the three mentions this coefficient. More than half of the selected studies (52%) do not mention the batch size, and only 35% of all the articles specify its value. The dropout method, which is more relevant to regularization, is present in 61% of the selected articles (only 43% specify the dropout ratio). 43% of the selected articles state that they perform stochastic gradient descent (SGD). However, in a strict sense, SGD is a generic term, and 90% of the selected articles use SGD with momentum. Moreover, SGD is generally confused with mini-batch GDMs⁵³ which is the case for 70% of the selected articles, which use the term batch size simultaneously with the term SGD.

In Table 5, it can be seen that 35% of the selected articles do not describe the toolbox for the implementation of the DL models. 26% of the selected articles do not provide the kind of infrastructure. Supposing that a correct description of a GPU needs at least the name of the constructor, the class and the memory size, only 30% have this information. It can also be observed in Table 5 that there is no convention for reporting the infrastructure.

The best way to reproduce an algorithm and to explore the hyperparameters or the architecture of a DL model is to have access to the source code. In Table 5, we observe that only 17% of the articles release the source code. These articles^47,48,49 are the same as those that provide an exhaustive description of the framework for reproducibility.