Introduction

Medical imaging plays a central role in medicine today because it can reveal the anatomy of the patient. However, to leverage the full potential of medical images, it is necessary to analyze them via image processing. One of the main clinical tools is image segmentation1, 2. Medical image segmentation can be defined as an automatic (or semiautomatic) process to detect boundaries within a 2D or 3D image. It is based on information such as pixel intensity, texture and anatomical knowledge.The result of segmentation can then be used in further applications and in gaining insights2; examples include the quantification of tissue volumes3, 4, diagnosis5, 6, the localization of pathology7, 8, the study of anatomical structure9, 10, treatment planning11, and computer-integrated surgery12.

Manual medical image segmentation leads to two main issues: much time is needed for delineation, and reproducibility is called into question. First, the time needed to segment is incompressible, and it is correlated with the number and the size of images. Since the size of these two parameters is increasing due to the ease of facility access to medical imaging and the improvement of acquisition technologies, manual segmentation is becoming intractable. Second, reproducibility corresponds to the agreement between the results of multiple measurements of the data (here, the segmentation results) under the same methodology. In medical image segmentation, it is well known that there is inter- and intraoperator variability. The former relates to the observed differences in the segmentation results obtained by two different operators, while the latter relates to the observed differences between two results of segmentation tasks performed by the same operator at two different times. Due to the crucial role of segmentation in medical diagnostics and treatments, the reproducibility of the method is fundamentally important.

These two issues lead one to consider automatic segmentation. Automatic segmentation consists in determining a prediction model and its inherent parameters relative to a given class of problems (for example, the kind of imaging performed or organs imaged). These parameters can be divided into two classes: the hyperparameters associated with the model and the parameters estimated from the dataset. The aim of automatic segmentation is to estimate the best parameters to obtain highly accurate results over the training dataset while maintaining good generalization for other datasets of the same class of problem, also called “test datasets”. In other words, the algorithm must avoid perfectly fitting the training set with poor accuracy results on the testing set. This problematic phenomenon is also called “overfitting” (see page 108 of the book13).

The rapid development of new automatic segmentation algorithms since the 2000s is strongly connected to the rise of machine learning2. During the last decade, a specific field of machine learning and artificial neural networks, called “deep learning” (DL)13, 14, has outperformed classical segmentation methods15. A neural network with several hidden layers is considered a ‘deep’ neural network, hence the term ‘deep learning’14. This is the case for several reasons, for example, nonsupervised feature extraction via convolutional layers and the possibility of dealing with a very large dataset via efficient optimization methods such as backpropagation of the gradient (see chapter 6.5 of the book by Goodfellow et al.13). Several DL architectures have been applied to medical image segmentation, including fully convolutional networks (FCNs)16 and U-Net17 (see Litjens et al.15 for a recent review). FCNs16 are built from locally connected layers, such as convolution, pooling and upsampling layers. An FCN is composed of two main parts: the downsampling and upsampling paths. The downsampling path captures contextual information, whereas the upsampling path recovers spatial information. Moreover, skip connections between layers are performed to recover fine-grained spatial information that is potentially lost in the pooling and downsampling layers. U-Net17 is built upon FCNs. The main difference is that each downsampling scale is linked to the corresponding upsampling scale with a concatenation operator. In this way, each upsampling scale has the information of the corresponding downsampling scale and the lower upsampling scale, leading to better segmentation.

However, although DL algorithms perform well, they are complex. A number of factors may explain the variability in the obtained results: the intrinsic variability of the dataset, the stochastic process during optimization, the choice of the hyperparameters relative to the optimization and regularization processes, and the choice of the DL architecture itself. This variability in the different parts of the framework leads to some difficulties in analyzing the reproducibility and making comparisons between frameworks. In addition, this variability leads to numerous parameters and hyperparameters being set. Furthermore, as highlighted in Joelle Pineau’s reproducibility checklist18, provided during NeurIPS 2019, describing the DL methods becomes its own challenge for reproducibility. Moreover, the strategy for evaluating the segmentation results, and thereby the variability of the method, is complex. There are a plethora of metrics19 to analyze segmentations, leading to various ways of comparing the methods.

Along these lines, three main questions, at least, about variability and reproducibility can be formulated.

  • Question 1: Is there enough information in published articles in the field of medical image segmentation with DL to correctly reproduce the results?

  • Question 2: If the information is provided, has the variability in the several steps of the DL framework been considered?

  • Question 3: Does the evaluation system for the segmentation results correctly reflect this variability?

These three questions are crucial for the application and potentially the evaluation of the segmentation algorithms. After focusing on the concept of reproducibility in medical image segmentation and on how to consider the different sources of variability in DL, we will review the literature to provide an overview of the practice of reproducibility in the fields of medical image segmentation in DL, based on three main topics: (1) the description of the methods, (2) the analysis of variability and (3) the evaluation system. On the basis of this synthesis, we will propose recommendations to appreciate the results of new DL strategies.

Related work

In this section, we will broadly address the issues of the reproducibility and evaluation of segmentation in medical imaging. Then, we will outline several sources of variability in the DL framework that can lead to difficulties for reproducibility.

Reproducibility and evaluation of segmentation in medical imaging

Reproducibility is a popular topic in science20. Hence, numerous articles21, 22 reveal a potential crisis of reproducibility in the different fields of science. Thus, most scientists have experienced a failure to reproduce results21 (more than 50% in the case of their own works in medicine, physics and engineering and more than 75% in the case of works by another person in the same fields).

In the rest of the article, we will follow the definition of the report of the National Academies of Science, Engineering, and Medicine20: reproducibility means obtaining consistent results using the same input data, computational steps, methods, and conditions of analysis; it is synonymous with computational reproducibility. Moreover, this report20 (recommendation 5-1, page 7) recommends that researchers should provide an accurate and appropriate characterization of relevant uncertainties when they report or publish their research. These uncertainties include stochastic uncertainties.

Reproducibility can be assessed with different procedures. First, reproducibility can be analyzed by intraclass correlation (ICC), proposed by Shrout and Fleiss23. The score obtained, which is between 0 and 1, indicating poor and perfect reproducibility, respectively, enables a comparison between intra-individual and inter-individual variabilities. Another statistical tool generalizing the ICC is analysis of the variance24 (ANOVA). It provides a collection of tools focusing on the variability of the means among groups. One interesting point is that ANOVA can deal with multiple factors.

One of the main sources of variability in machine learning originates from the difference between the observed samples of the dataset and the real distribution of the dataset. The fact that the learning step of the algorithm is performed on only a part of the distribution can affect the reproducibility and particularly the replication of the results. A class of tools, called “cross-validation” (CV)25, is available in studying this variability. A special focus on these methods is made in the next section, concerning variability in a dataset.

Moreover, segmentation in the specific field of medical imaging is complex in terms of reproducibility for several reasons. First, the available datasets are generally limited: the number of samples is generally less than 100 items. Then, each segmentation task must be considered with regard to the image modality (for example, whether it was obtained by MRI, scanner, or echography) and the organ studied26. Furthermore, the masks in segmentation are usually generated manually. This leads to some intra- and inter-rater variability. Consequently, there is no certain truth but only a gold standard. Additionally, there are also several metrics to evaluate segmentation, such as the dice coefficient (DC) and the modified Hausdorff distance. Each metric focuses on a specific aspect of the segmentation19. For example, a metric can correctly reflect the good overlap between a segmentation mask and a gold standard, but it cannot highlight the smoothness of the contour. To correctly describe the quality of a segmentation, several metrics are necessary19, 26. An adequate system of evaluation will permit accurate consideration of the variability in DL frameworks.

Variability in DL frameworks

In the next sections, five different kinds of variability are presented. The DL framework and its related sources of variability are displayed in Fig. 1.

Figure 1
figure 1

The different steps of a DL framework are displayed in solid-line boxes: the steps related to the dataset (the data augmentation and cross-validation strategies, DL architecture design, training step (with the optimization procedure), and estimation of the hyperparameters of the optimization) and the evaluation system. The different sources of variability are highlighted in dashed-line boxes: the variability linked to (A) the dataset, (B) the DL architecture, (C) the optimization procedure, (D) the hyper parameter estimation for the optimization and (E) the implementation and infrastructure.

Variability in the dataset

To infer a segmentation with a DL model (and more globally, a supervised machine learning model), the classic method consists in splitting the data sets into three parts. The first part corresponds to the “training set” for estimating the parameters of the model: it is composed of the raw data and corresponding labels. Based on the raw data, the DL algorithm infers some results that are compared to the labels. The DL parameters are then optimized to minimize the error between the results and labels. The second part is the “validation set”. It is more specific to the DL community. It estimates the unbiased error of the trained DL model. It permits the training of the DL to be stopped to avoid overfitting. It is not mandatory and is usually used in practice when the dataset has enough samples. Finally, the last part, called the “testing dataset”, provides an unbiased evaluation of the final model of the DL algorithm. The proportions of the different parts depend on the initial number of samples and can significantly affect the expected degree of generalization. Let us consider a trivial example, where only one sample is chosen for the testing set; the evaluation of the DL depends greatly on the selected sample. In the same way, selecting few samples for the training set leads the model to perfectly learn the training data.

To avoid bias in the data selection, strategies called “cross-validation” are performed. These strategies consist in dividing the dataset into several folds, then assigning these folds to the training, validation and testing sets. At the end of the DL model estimation and evaluation processes, the folds are reassigned for novel estimations and so on. The cross-validation strategies permit one to address variability in the data.

The number of parameters to estimate in a DL model is often larger than the number of images in the datasets. Moreover, in medical imaging segmentation, the heterogeneous appearance of the target organ (anatomical variability) or of the lesions (size, shape or position) poses a great challenge. One solution, called “data augmentation”27, generates new samples by applying different transformations to the dataset (e.g., rotation or flipping). In this way, unseen target organs or lesions can potentially be approximated. However, this also adds sources of variability in the general framework, since there is no consensus on which transformation to perform and the parameters of the transformation are generally randomly chosen.

Variability in the optimization

This section focuses specifically on the variability of optimization with an already estimated and constant set of hyperparameters. One of the main factors of complexity is the very large number of parameters of the model to be estimated. Training these parameters in DL models is very challenging. Solving the optimization problem of estimating these weights is generally an extremely difficult task with a stochastic process.

Each weight in the DL algorithm corresponds to another parameter (which can be seen as another dimension) of the cost function of the optimization. DL models often have millions of parameters, making the search space to be evaluated by the algorithm extremely high dimensional, in contrast to classic machine learning algorithms. Moreover, the addition of each new dimension dramatically increases the distance between points in this high-dimensional space. Consequently, the search space is drastically increased. More precisely, the number of possible distinct configurations of a set of parameters increases exponentially as the number of parameters increases. This is often referred to as the “curse of dimensionality” (see page 155 of Goodfellow et al.13).

In addition, the cost function is generally nonconvex (see page 282 of Goodfellow et al.13). These facts lead to several issues: the presence of local minima and flat regions with the constraint of the high-dimensionality of the search space. The best general algorithm known for solving this problem is stochastic gradient descent (SGD) (see chapter 5.9 of the book13), where the model weights are updated at each iteration using the backpropagation-of-error algorithm. However, there is no guarantee that the DL estimation will converge to a good solution (or even a good local optimum), that the convergence will be fast, or that convergence will even occur at all28.

Nevertheless, recent work may suggest that local minima and flat regions may be less challenging than previously believed29,30,31. From Choromanska et al.29, it appears that almost all local minima have very similar function values to the global optimum, and hence, finding a local minimum is sufficient. These last results have been obtained for classification tasks. Furthermore, the important convolutional step of segmentation is not considered in Choromanska et al.29 or Dauphin et al.30.

To the best of our knowledge, only one conference article32 addresses this issue of stochastic optimization uncertainties in medical imaging segmentation with DL. The authors show that DL models estimated several times with the same data show differences, but the results obtained on the evaluated metrics are not significantly different.

Variability in the hyperparameters

The hyperparameters correspond to the global settings of an algorithm. In machine learning, each parameter impacts the results differently33. Several hyperparameters must be fitted before the training of the DL model, for example, the learning rate for optimization and the dropout percentage for regularization13.

There are different ways to set them. First, manual configuration is considered. This strategy limits the exploration space, but the computation time is relatively short compared to those of other methods since only a rough approximation of the best hyperparameters is expected. The second kind of strategy is based on automatic space exploration. The classic method, called “grid search”, tests every combination of hyperparameters. It will find the best set of hyperparameters, but the computational cost increases greatly with the number of hyperparameters. Another strategy, called “random search”, randomly samples the set of hyperparameters to be evaluated. This method generally cannot reach the optimum values, but approximates them in fewer iterations than grid search.

A new strategy34, called “Bayesian optimization”, automatically infers a new combination of hyperparameters based on previous evaluations. In this case, the space exploration is intermediate and is driven by experience. The cost of exploration is lower than that in a grid or random search.

Variability in the DL architecture

Here, only the number of nodes, the number of layers, the kinds of layers (for example, convolutional, pooling, or dense) and the connections among the layers are considered in the architecture. Even with these four parameters, the number of available architectures is infinite.

In practice, only three strategies are chosen for the selection of the architecture. The first one consists in selecting a well-known DL model that has already proved its performance in previous work15, such as U-Net17 for image segmentation. This method is considered more often in clinical application fields. This method is not expected to provide the best architecture for a specific problem.

Another strategy consists in manually handcrafting the DL architecture. This leads to a plethora of architectures15. However, it does not guarantee the best architecture, and modifications of the tested architecture are generally not considered. The final strategy, also called “network architecture search”, is to automatically create a DL architecture through optimization for a specific task35. The drawback of approximating the best architecture is a very high cost in time and resources. For instance, the network architecture search proposed in36 tested 20,000 architectures in 4 days with 500 graphics processing units (GPUs).

The estimation of the minimal network architecture needed to achieve a certain segmentation accuracy on a given dataset can enable variability in the DL architecture to be avoided. However, as discussed in the review37, this topic remains a challenge.

Variability in the middleware and the infrastructure

The last section focuses on algorithms relevant to DL. In this section, the possible variability due to the middleware and the infrastructure is considered. There are many toolboxes to implement a DL framework. To the best of our knowledge, no publication has addressed the problem of reproducibility in DL with regard to the middleware. Different implementations are compared, for example, by programming language, in terms of their capacity to use a GPU. A review of different implementations and their characteristics can be found in38.

The learning phase in DL can be a very long process, considering the complexity of the architecture of the DL and the dataset size. As previously explained, the search for hyperparameters can also be prohibitive. To improve the processing time, several solutions based on the infrastructure are considered. Different kinds of infrastructures39 can be used, such as a central processing unit (CPU), GPU, or tensor processing unit (TPU). However, some technical characteristics such as memory precision for different memory sizes can affect the accuracy of the results40. Another example, the numerical operations performed on the GPU, can be nondeterministic, leading to nonreproducibility in the results41.

Another possibility for accelerating the processing time is choosing a parallel or distributed DL model. These techniques come with their own different methods that potentially impact the reproducibility of the outcome. For an overview of the parallel and distributed models and their own challenges, the interested reader can refer to42, 43.

Methods

In this section, we first introduce how the literature review was performed, and then, we briefly describe the different metrics.

Literature review

There is no standard for the reproducibility or evaluation of DL in medical image segmentation. The aim of this review is to reflect common practices for DL in medical image segmentation. To fulfill this expectation, this review focuses on three goals: (1) to inspect how the methods are described to enable work to be reproduced, (2) to present the variety of methodology and highlight the variability among DL frameworks and (3) to outline the kinds of evaluations used in DL.

To observe the variability of the methodology and evaluations in the literature, we focus on the 23 articles presented in the review article15 in the specific section “Tissue/anatomy/lesion/tumor segmentation”. This review article was chosen because it was the most relevant found on Google Scholar (with the mandatory keywords ’medical image segmentation neural network’ and at least one keyword in ’review survey’) among more than 2300 hits on Google Scholar (in December 2019). All the considered articles propose recent strategies: the oldest one was published in 201444 and the mean year of publication is 2016. Moreover, the mean number of citations on Google Scholar (in December 2019) is \(232.3 \pm 308.2\) (median = 97, min = 20, max = 1074).

To obtain a more recent overview, we select 3 reviews of medical image segmentation methods37, 45, 46. We focus specifically on how the problem of variability and reproducibility is addressed in the scientific literature.

We focus on the possible variability introduced by the data itself, by the optimization strategy and associated hyperparameters, by the middleware and the infrastructure, and by the evaluation measure. For all the inspected parameters or evaluations, we determine the presence of the terms and their potential values. This consideration is important for being able to reproduce the different works. When a framework is described, we determine whether the correct terms are used appropriately. To highlight this phenomenon, we consider the kind of algorithm used in the optimization strategy.

For the data variability, we consider whether the DL algorithm is tested on several datasets, whether they are public or private, the number of datapoints available, whether data augmentation has been performed, the proportion of training, validation and testing sets and the possible application of a cross-validation method. For the optimization, we examine whether different parameters are recorded (the optimization strategy, learning rate, batch size, and presence of dropout regularization). We also investigate whether the hyperparameters of the optimization are hand-crafted or automatically optimized (and whether this information is available). For the middleware and infrastructure considerations, we report whether these details are provided. Special attention is also paid to the implementation of the DL model and the processing unit considered. We also determine whether the calculations are performed on a distributed system, which can be a large source of variability itself. For the evaluation, we consider the number and kinds of measures, and whether the variability of the results is described (the presence of standard deviations).

Metric evaluation

The evaluation of the different estimations of DL models is assessed with the DC, the true positive rate (TPR), also called the sensitivity (Sens.), the true negative rate (TNR), also called the specificity (Spef.), and the average volume distance (AVD) (linked with the Hausdorff distance). We chose these metrics because they often appeared in the articles of the literature review. The different metrics are described in Table 119. We consider various metrics, since each metric has its drawbacks, and evaluate only a part of the segmentation problem19, 26. Readers interested in additional metrics and the interactions among them can read the study of Taha et al.19.

Table 1 Segmentation metrics

Synthesis of the literature review

The main results are displayed in Tables 234 and 5. Table 2 focuses on the data variability. Table 3 focuses on the evaluation procedure. Table 4 presents the optimization strategies. Table 5 considers the middleware and the infrastructure. We are interested in the following three main points: (1) whether the DL strategy is correctly described as enabling the work to be reproduced, (2) whether the variability of the different parts of the DL framework are considered, and (3) how the evaluation is performed and the results are reported.

Description of the DL strategy

In this section, we focus not on the fact that some methods are performed and some are not, but on whether the methods are clearly described. It can be seen that a method may have been applied without any mention in the text.

The main findings are as follows: only two articles47, 48 (9% of the articles) sufficiently describe the hyperparameters and the dataset to enable the work to be reproduced. One study49 has just one hyperparameter missing (the batch size) in the text, but the source code is available with this information included. Here, we focus on descriptions relative to the dataset and to the optimization stage. These results are detailed in Fig. 2. The left side of the figure is relevant to the description of the dataset (the training proportion, the data augmentation and the validation set) and the right side to the description of the optimization (the optimization procedure, the learning rate, the dropout procedure and the batch size). Some criteria are described well, such as the training proportion (83% of the considered articles) or the optimization procedure (83% of the selected articles). However, some characteristics are less available, such as the procedure of data augmentation (only 35% of the articles). To obtain a reproducible study, all these characteristics must be described. Only 9% of the selected articles provide sufficient information to be reproducible.

In Table 2, the dataset management method is described. All the selected articles correctly present the dataset and the number of samples. 17% of the articles do not explain the training proportion used to estimate the parameters. Only 57% of the selected articles clearly state whether they used a validation set, and 35% whether they performed data augmentation.

Table 4 focuses on the hyperparameters of the optimization process. 17% do not explain the optimization procedure at all. One44 cites a generic name (GDM, for gradient-based method) without any explanation. The learning rate parameter is generally present with its initial values (or range of values). Four articles do not mention the values of the parameters. For the specific AdaDelta optimization used in50,51,52, there is no learning rate. However, some coefficients need to be specified, such as the sensitivity ratio. Only one article52 of the three mentions this coefficient. More than half of the selected studies (52%) do not mention the batch size, and only 35% of all the articles specify its value. The dropout method, which is more relevant to regularization, is present in 61% of the selected articles (only 43% specify the dropout ratio). 43% of the selected articles state that they perform stochastic gradient descent (SGD). However, in a strict sense, SGD is a generic term, and 90% of the selected articles use SGD with momentum. Moreover, SGD is generally confused with mini-batch GDMs53 which is the case for 70% of the selected articles, which use the term batch size simultaneously with the term SGD.

In Table 5, it can be seen that 35% of the selected articles do not describe the toolbox for the implementation of the DL models. 26% of the selected articles do not provide the kind of infrastructure. Supposing that a correct description of a GPU needs at least the name of the constructor, the class and the memory size, only 30% have this information. It can also be observed in Table 5 that there is no convention for reporting the infrastructure.

Figure 2
figure 2

The left side, resp. the right, of the figure is relevant to the description of the dataset, resp. the optimization. The description of the training proportion is present in 83% of the selected articles. The terms of data augmentation, resp. the validation set, are described in 35%, respectively 57%, of the selected articles. For the optimization procedure, the name of the optimization algorithm is missing in 17% of the selected articles. Regarding the hyperparameter learning rate, dropout and batch size, their values are available in only 57%, 52% and 35% of the articles, respectively. These coefficients are mentioned in the text without any values in 19%, 9% and 13% of the articles, respectively. In the end, only 9% of the evaluated articles have enough information to be reproducible.

The best way to reproduce an algorithm and to explore the hyperparameters or the architecture of a DL model is to have access to the source code. In Table 5, we observe that only 17% of the articles release the source code. These articles47,48,49 are the same as those that provide an exhaustive description of the framework for reproducibility.

Variability in DL frameworks

In the selected published articles, we are interested in the variability in the dataset, the optimization, the hyperparameters, the architecture of the DL framework, the implementation and the infrastructure. The main results in the next section are illustrated in Fig. 3. The figure is separated into four parts describing the variability of the dataset size, cross-validation strategies, optimization algorithms and implementation. The main conclusion is that there is no consensus on these topics. The rest of the results are discussed in detail in the subsequent section.

Variability in the dataset

In Table 3, the results are focused on data variability. More than half of the methods are evaluated on more than one dataset and with publicly available datasets (in general, provided by data challenges such as BRATS54.

30% of the articles only test their algorithms on private datasets.

Only 6 datasets have more than 100 samples, and in these 6 datasets, 4 come from the same public source, BRATS. This highlights the difficulty of obtaining large datasets. Consequently, data augmentation is important for medical image segmentation. Since the segmentation of a voxel can be performed locally, data augmentation based on patches can be considered. However, 13% of the articles do not clearly describe whether there is data augmentation or whether the patch strategy is considered, or how many patches are selected. The training proportion and the CV strategies permit avoiding or limiting bias relative to the chosen dataset. 52% of the articles do not use any CV strategies.

Variability in the optimization

One article47 presents an original strategy for managing the intrinsic variability in the optimization stage of the DL: the results of 3 DL models are merged, leading to better results than one alone. The other 22 articles do not discuss this notion.

Variability in the hyperparameters

We can observe in Table 4 that only one article,55, clearly explains the tuning of the hyperparameters with a grid search strategy. Another article,56, claims to automatically tune the hyperparameters without any explanation. In the articles considered in Table 4, there are three main strategies: SGD with momentum, RMS-prop and AdaDelta. One of the main hyperparameters is the learning rate, which varies greatly, from \(10^{-2}\) to \(10^{-4}\). Two articles48, 57 consider a range of values. At shown in Table 2, the training proportion, which can be viewed as a hyperparameter, has a wide variability (from 20% to 95% of the dataset). It is generally selected according to the size of the dataset. These results highlight the variability in the choice of hyperparameters for data management and optimization.

Variability in the architecture of DL frameworks

In Table 2, we can see that the main strategy is to use a convolutional neural network (CNN) or recurrent neural network (RNN) architecture (91% of the methods) for segmentation (these architectures are types of DL models14). Two articles51, 58 test several different DL architectures in their frameworks (5 for51 and 2 for58).

Only one article55 performed a grid search algorithm to determine the structure of the architecture (based on the kernel and max-pooling size for each layer and on the number of layers).

Variability in middleware and infrastructure

In Table 5, we can see that several implementations are considered. More precisely, four different toolboxes (Theano59, Mat-ConvNet60, Caffe61 and Pylearn262) are referenced in the articles. Only one in-house implementation was used52.

In 13% of all the articles, a high-level API (Keras63 or Lasagne64) is deployed in addition to these toolboxes.

All the articles describing the infrastructure performed their algorithms on a GPU. No articles referred to a distributed system for the implementation of the DL algorithm.

Figure 3
figure 3

Four different sources of variability. (A) There is a large variability in the dataset size. 68.5% of the numbers of samples in the dataset are less than or equal to 50. (B) In general, no cross-validation strategy is considered (more than 50% of the articles). (C) There are 5 different optimization algorithms introduced in the different articles. The main one is SGD based on momentum (SGM(M)). The gradient-based method (GBM) and stochastic gradient descent (SGD) are only general terms. (D) There are 5 different implementations of DL frameworks. Even the Theano implementation is used in 42.9% of the considered articles, and there is no consensus among the implementations.

Evaluation of the variability

Almost half of the articles consider fewer than 3 metrics, which is the number recommended by Udupa et al.26 (see Table 3 and Fig. 4). In a quarter of the articles, no variability relative to the metrics (such as the standard deviation) is provided. In some of the cases, this can be explained by the context of the data challenge platform for evaluation. In most articles, the variability is displayed with a boxplot. Only two articles report the complete results for each participant65, 66.

For the evaluation metrics, the DC is considered in all articles. There is a large variability in the other metrics, since 22 different names of metrics can be found. Some of them are the same even if the names are different, such as the true positive rate, recall and sensitivity.

Reproducibility in the literature reviews

To evaluate the impact of reproducibility in DL for image segmentation after 2017, we consider the 3 reviews37, 45, 46. All the reviews highlight the problem of correctly comparing different methods. To address this issue, the reviews suggest testing the DL frameworks on public datasets through challenges and providing the code publicly. Moreover, the study45 suggests that the difficulty of comparing the frameworks comes from the numerous available metrics used to evaluate segmentation. Furthermore, the study37 highlights the problem of reproducibility due to the lack of a correct description of the frameworks. Finally, all the reviews consider reproducibility as a challenge.

However, none of them raises the question of the intrinsic variability of DL frameworks. They do not refer to multiple metrics to correctly evaluate segmentation or discuss the cross-validation aspect. For the reported results in these reviews37, 45, no variability measure is provided, such as the standard deviation.

Figure 4
figure 4

The number of evaluation measures used in each article. Note that the number of measures required to correctly evaluate a segmentation result is 3.

Table 2 The training size, the kind of data augmentation (DA), the presence of the DA term and the validation set (VS) term, the training size proportion and the cross-validation (CV) strategy in each article.
Table 3 The different DL models, kinds of datasets (number of datasets, denoted as Nb DS, the kind of dataset (public or private) and the kind of evaluation (type, number and variability of the measures).
Table 4 The kind of optimization, whether the hyperparameters (HPs) are handcrafted, the learning rate (the value (V.) and the presence (P.) of the term), the batch size (the value (V.) and the presence (P.) of the term), the dropout regularization (the value (V.) and the presence (P.) of the term) and whether the code is open source.
Table 5 In the second column, the different implementations are described (Theano \(^{1}\), Mat-ConvNet \(^{2}\), Caffe \(^{3}\), Keras \(^{4}\), Pylearn2 \(^{5}\) and Lasagne \(^{6}\)).

Proposals for practices conducive to reproducibility in medical image segmentation with DL

On the basis of the literature review, our recommendations focus on three main points: (1) an adequate description of the DL framework, (2) a suitable analysis of the different sources of variability in the DL framework, and (3) an efficient evaluation system for the segmentation results.

The flowchart of the different proposals is displayed in Fig. 5. Even if each part is independent, there is a natural order that we follow in our recommendations.

Figure 5
figure 5

The proposals are separated into three main parts: (A) an adequate and complete description of the DL framework for reproducibility purposes, (B) an analysis of the different sources of variability, and (C) an efficient evaluation system for image segmentation.

Recommendations for the description of the framework

First, to perform reproducible research, it is mandatory to correctly describe all the aspects of the framework, from the DL model and its related hyperparameters to the evaluation system. The initial step consists in clearly describing the algorithm and/or the model of the DL architecture. A schema of the DL architecture should be provided since the architecture is generally complex.

For the data part, several steps are mandatory:

  • A complete description of the dataset is required, with the kind of acquisition (i.e., MRI or scanner), the size of the images and the total sample size. If the dataset is publicly available, a download link should be provided.

  • For the preprocessing stage, the authors should explain whether some data are excluded. In the case of data augmentation, the different kinds of transformation must be described and the final number of samples should be included. For the special case of images, if the data augmentation consists in the selection of multiple patches, the characteristics and the final number of patches should be described.

  • The allocation of the dataset samples into training, validation, and testing sets should be clearly described. If no validation set is created, this must be clearly stated and the choice should be explained.

  • The cross-validation strategy should be described along with the number of folds considered.

For the optimization step, the chosen algorithm should be clearly referenced with its name and its corresponding publication, and the final hyperparameters, such as the learning rate or the batch size, should be provided. If several evaluations are performed, the number of trials should be given.

For the selection of the hyperparameters of the optimization process or the design of the DL architecture, the method should be explained, even if it is handcrafted. More precisely, the method and the search space of the different hyperparameters should be provided.

A description of the computing infrastructure should be given with technical specifications: at least the name of the constructor, the class of the architecture and the memory size should be provided. For the middleware, the kind of implementation should be described (an available toolbox or in-house code, and the build version). If the toolbox is public, the link to the toolbox should be available. In general, the best solution is to provide a link to the downloadable source code with all included dependencies.

Finally, for the evaluation, a clear description of the results should be given with the average metrics and their variations. If a figure is displayed, such as a boxplot, the values of the error bars should be provided.

All these recommendations have been proposed in the Machine Learning Reproducibility Checklist18. There are two main differences between their recommendations and ours. First we merge the different points by the source of variability, whereas they merge by the section of the article (Methods and Results). The second difference consists in the particular focus on the image segmentation proposed here.

Recommendations for the analysis of variability

As shown by the literature review, sources of variability occur in each part of a DL framework: the dataset, the optimization procedure, the selection of the hyperparameters, the DL architecture, and the computational infrastructure. Each kind of variability is different and should be considered with its own tools.

Variability of the dataset

In practice, the available data are always a subset (or sampling) of the true distribution. This sampling effect typically introduces a bias which, in turn, results in variability in the final results.

This fact is important in medical image processing, where the number of samples is limited. A DL network learning from a particular sampling of the data can lead to overfitting. The recommended common tool is cross-validation. It must be noted that the purpose of cross-validation is different from the use of batches in optimization. The choice of the cross-validation method (the number of samples left out) should be considered with regard to the sample size (i.e., as in the leave one out or k fold out strategy). The choice of the samples to be analyzed (patches or 2D images from 3D images) can lead to strong correlations between samples.

To obtain a better sampling of the true distribution of the training dataset, data augmentation must be considered. The different transformations for the augmentation must be chosen carefully with regard to the organs studied. Furthermore, this data augmentation can enhance the accuracy by correcting a very poor diversity of the training dataset with respect to the testing dataset.

Variability of the optimization

For the optimization, the analysis of variability is often studied for classification purposes29,30,31. Our recommendation to manage the variability of the optimization is to perform at least several trials with the same hyperparameters on the same datasets. The observed variability should be recorded in the average score and its corresponding standard deviation.

For a deeper analysis taking account of the dataset variability, the optimization should be produced several times in a cross-validation strategy for the dataset. All the results for each fold of the dataset should be grouped. A one-way ANOVA statistical test should be performed on the different groups to test whether there is a difference or interaction over the dataset considering the optimization. If the assumptions of ANOVA are violated, a strategy to perform a nonparametric test over cross-validation has already been proposed74 to better estimate the residual error and to analyze the interaction between the algorithm and the learning dataset.

Variability of the hyperparameters

Our recommendation for the selection of the hyperparameters is to avoid handcrafted selection. Even if this selection is fast, the set of hyperparameters obtained can have a high variability since the hyperparameters can lie in a range of large variability33. Automatic selection by a grid search, random search or Bayesian optimization algorithm enables optimum values to be obtained that are potentially more robust. It should be noted that a Network Architecture Search Best Practices Checklist75 was written in September 2019 on this specific subject.

Variability of DL architectures

The main problem addressed by the evaluated articles is: how should different DL architectures be compared? The comparison should consider the variability in the dataset, the optimization and the hyperparameter selection. Our recommendation is to perform, for each evaluated DL architecture, several trials of optimization on each fold of the dataset provided by a cross-validation strategy. A two-way ANOVA can be considered to evaluate the variability of the metrics with regard to the different folds of the cross-validation and the different DL architectures. If the assumptions of the ANOVA are violated, a nonparametric test can be proposed76.

Variability of the infrastructure

In general, it is difficult to test the variability of infrastructures since their costs can be high. Our recommendations are to correctly capture the specificity of the infrastructure to avoid side effects for the reproducibility. Two important factors are the number of processing units and their characteristics (the kind of calculus unit used and the available random access memory (RAM)). The number of processor units will deeply impact the framework (distributed or non-distributed system). The RAM can affect the size of the batch during optimization. The kind of calculus unit used can lead to quantization and problematic noise calculations in the optimization.

Regarding the middleware, an automatic deployment of the operating system and the toolbox associated with the DL framework are recommended. This should be based on a complete description of the system.

In addition, distributed systems can be considered to achieve simulations in a reasonable time. To mitigate the reproducibility problem, some recommendations for the network should be made, such as the use of Infiniband (to avoid latency) or the use of a compartmentalized network (to avoid interactions with other users).

Recommendations for the analysis of the evaluation system

In the context of image segmentation, at least three metrics should be considered19. Because some of them are correlated19, it is important to carefully choose which metric suits the scenario at hand26.

Even if several metrics are defined19 and, to the best of our knowledge, no consensus exists, we propose to at least evaluate the segmentation methods with the next three most common metrics: the DC, the TPR and the FNR. These metrics are described in the Methods section. Readers interested in image segmentation metrics can see more complex evaluations based on the recommendations of some studies19, 26.

Discussion

The complexity and heterogeneity of DL frameworks are responsible for multiple kinds of variability. Because of the reproducibility crisis21, 22, researchers have highlighted multiple factors that induce variability in the results obtained, as well as important guidelines that must be respected in order to minimize—or at least quantify—these effects: (i) for other researchers to be able to replicate the obtained results, it is necessary to precisely describe the DL framework in use as well as its optimization procedure; (ii) potential sources of variability must be acknowledged and, when possible, evaluated in order to determine their importance. Last, it is crucial to consider the specifics of the field being researched: already-existing data processing methodologies and evaluation procedures must be properly incorporated within the DL framework—see for instance19, 26 for medical image segmentation. In practice, however, assessing reproducibility and variability is a rather difficult task in the context of DL frameworks.

  • Key factors are generally interdependent For instance, the variability due to the optimization procedure not only depends on the choice of hyperparameters but also on the input data provided, i.e., the datasets. Facing such an issue, there is a need for new mathematical tools: (i) to de-correlate the overall variability and capture the individual effects associated with given parameter subsets; and (ii) to better compare the results obtained with different DL solutions.

  • Heterogeneous nature of the variability This effect often makes it difficult to relate different sources of variability. For instance, let us consider the variability in the input data distribution on the one hand, and the variability in the optimization stochastic process on the other hand: these cannot be addressed in the same way, which in turn leads to different mathematical tools being needed to evaluate this variability.

  • Hardware/software perturbations Typically, variability is estimated from a large number of repeated simulations, which requires powerful and/or distributed systems. These systems also induce variability, as they may differ slightly (in terms of architecture, data quantization, rounding strategies, implementation constraints, etc.).

Conversely, variability may also be seen as a blessing. For instance, merging different optimization solutions or different DL frameworks improves the segmentation47 and, more generally, the robustness.

Finally, there is no clear consensus on the meaning of reproducibility, robustness and generalizability77. The notion of reproducibility should be driven mainly by the kind of application.