In 2019, a series of tweets went viral where a tech entrepreneur was complaining about the fact that Apple Card offered him twenty times the credit limit that it offered to his wife, although they had shared assets. After complaining to Apple representatives, he got the reply: “I don’t know why, but I swear we’re not discriminating, IT’S JUST THE ALGORITHM” [1, 2]. Apple co-founder Steve Wozniak replied that the same thing happened to him and his wife and added [3]: “Hard to get to a human for a correction though. It’s big tech in 2019.” These complaints led to a formal investigation into the potential sexist credit scoring by Apple Card [1, 2]. This example shows how predictive modelling is facing major challenges due to its inability to explain its decisions, which often stems from the use of complicated models. But why is everyone using these kinds of models? It is often claimed that they have a higher performance than more simple models, but is this always true? How often is it the case and to what extent?

This trade-off between accuracy and comprehensibility is argubaly one of the important debates in Artificial Intelligence (AI)Footnote 1 [4, 5]. This trade-off can either limit the performance of AI, if accuracy is lost due to comprehensibility restrictions (for example imposed by regulators) [6, 7], or hurt AI adoption, if user trust is lost due to opaqueness [8]. The Apple Card example shows that companies may use black box models to achieve higher predictive performance, but with the risk of being unable to explain their AI decisions to users or regulators. However, while there has been a lot of research mentioning this trade-off, with most claiming there is one [5, 8,9,10] and others contradicting this [11, 12], there is no systematic study that assesses to what extent there indeed exists a trade-off and for what types of datasets.

The goal of this paper is to provide such a systematic study. We focus on tabular datasets as we believe that for these datasets the trade-off would be less clear - and possibly smaller than expected. Deep learning models, which are models composed of multiple layers to learn representations of data with multiple levels of abstraction [13] and can thus be considered as black box models, perform very well for classification on homogenous data such as image, audio or text but they not necessarily outperform other machine learning techniques on tabular datasets [14,15,16].

Based on the analysis of 90 benchmark datasets across different domains, we study the nature of the differences between the accuracies among a number of widely used a) opaque (“black box”) models, b) comprehensible (“white box”) models, and c) surrogate models used to develop a comprehensible surrogate of the opaque ones. We call the difference between (a) and (b) “Cost of Comprehensibility”, that between (a) and (c) “Cost of Explainability”, and that between (b) and (c) the “Benefit of Explaining” (Fig. 1).Footnote 2 Our main findings are: first, there is indeed a trade-off but somewhat surprisingly it appears to be highly non-linear across datasets. Both costs are relatively small for most datasets, but very large for a few. Second, there are datasets for which the comprehensible models perform as well or better than the black box models, supporting that one should not forgo trying comprehensible models [17]. We call these datasets “comprehensible datasets”, as opposed to datasets where the black box is strictly better which we call “opaque datasets”. Understanding what makes a dataset “opaque” vs “comprehensible” and more so, given the non-linearities observed, what makes the costs very high (positive or negative) is a challenging question as it relates to understanding the data generation processes themselves (e.g., the “nature” of the data and problem at hand). We discuss initial results indicating that some of the main differences between opaque and comprehensible datasets are about their inherent complexity as well as the level of noise in the data. The results indicate that reporting some simple characteristics of a dataset can provide clues, for example to users or regulators, about the potential accuracy and comprehensibility trade-off. To summarize, the contributions of our paper are threefold:

  • A benchmark study comparing state-of-the-art white box and black box algorithms on 90 tabular datasets, and assessing their difference in performance;

  • An analysis of whether surrogate modelling could improve any trade-off between comprehensibility and accuracy;

  • Insights in how dataset properties could predict the nature/size of the trade-off we study.

Fig. 1
figure 1

Definitions of the Cost of Comprehensibility, the Cost of Explainability and the Benefit of Explaining

Background and setup of the study

What is comprehensibility?

Comprehensibility refers to the ability to represent a machine learning model and explain its outcomes in terms that are understandable to a human [18]. The lack of comprehensibility in black box models is one of their main pitfalls, as their inner working is hidden to the users preventing them from verifying whether the reasoning of the system is, for example, aligned with restrictions or preferences of how decisions are made [19,20,21]. Furthermore, it is easier to debug comprehensible models or to detect bias in them, and it also increases social acceptance [22]. In general, there are two ways to provide comprehensibility in machine learning [22, 23]: intrinsic comprehensibility is acquired when using models that are comprehensible by nature due to their simple structure, which are the so-called “white box” models [23], while post-hoc comprehensibility aims to explain the predictions without accessing the model’s inner structure [23], as provided by LIME [24], SHAP [25] or counterfactual explanations [26]. Another distinction that can be made is between global comprehensibility and local comprehensibility. Global comprehensibility allows to understand the whole logic of a model and follow the reasoning that leads to every possible outcome, where for local comprehensibility it is possible to understand the reasons for a specific decision [22, 27]. Comprehensibility is very difficult to measure due to its subjective nature. Some compare the comprehensibility of models using user-based surveys [28, 29] while others based on mathematical heuristics [9], typically the size of the model (e.g., number of rules for a rule learner, number of nodes for a decision tree, or number of variables for a linear model) [30,31,32,33]. Very deep decision trees, for example, can be considered as less comprehensible than a compact neural network [34]. We use the latter, heuristic approach to measure comprehensibility due to its objectivity and scalability.

What are intrinsically comprehensible models?

In line with the literature, we consider small decision trees, rule sets and linear models as comprehensible or “white box” models [8, 22, 27, 35]. We limit the size of these models during training in order for them to be comprehensible. We opted for seven as the size limit for comprehensibility, based on cognitive load theory [36]. According to this theory, the span of absolute judgement and the span of short-term memory pose severe limitations on the amount of information that humans can receive and process correctly, with seven being the typically considered maximum size in both cases [36]. We consider larger decision treesFootnote 3, rule sets and linear models as “black box” ones. We also consider three other machine learning methods in the list of black boxes we test: neural networks, random forests and nonlinear support vector machines. It is generally agreed upon that these algorithms are not comprehensible as their line of reasoning cannot be followed by human users. We base this choice of black box models on the results of benchmark studies in the literature, where these often are among the best performing ones, as can be seen in Table 1.Footnote 4 Comparing all possible models available is of course infeasible, which is a practical limitation of such a study. All the papers mentioned in Table 1 compare different machine learning models but none investigate the difference in performance between the best black box model and the best white box model, nor whether this can be linked to any dataset properties. Many papers claim that black box models will always have a better performance, or on the contrary that simpler models work equally well [11, 12], but a large-scale study about the difference of performance is missing.Footnote 5

Table 1 Models that are used in other benchmark studies

Surrogate modelling

A common practice is to mimic the predictions of a black box with a global white box surrogate model, in order to improve the accuracy while remaining comprehensible [50, 51]. The typical process is to first build a black box model using the available training data, and then build a comprehensible model by training a white box model using the predictions of the black box instead of the original training data. This process is called surrogate modelling [22], oracle coaching [52, 53],  or rule extraction in case the white box model is a decision tree or rule set [6, 54]. A key metric of the quality of the surrogate model is fidelity, which measures how well the predictions of the surrogate model match those of the black box [55]. The most common goal of this kind of modelling is to use the surrogate model to explain the black box model, while still using the black box to make predictions. This requires of course that the surrogate model is (1) more comprehensible than the black box model and (2) sufficiently explains the predictions made (high fidelity).

One can also use the surrogate model instead of the black box to make predictions, in order to improve the performance one could achieve using only comprehensible models. A possible reason why this approach can work, instead of just training a white box model directly using the training data, can be that the black box model may filter out noise or anomalies that are present in the original training data [53, 56]. In this case, a comprehensible model mimicking a black box may be more accurate than a comprehensible model trained on the original data, as shown in some previous work [51,52,53]. Therefore, we also investigate whether surrogate modelling can lead to better performing comprehensible models and, as such, improve the trade-off we study. Specifically, for each dataset we train a white box on the predictions of the best performing black box for that dataset. We call this a surrogate white box model as opposed to a comprehensible model trained on the training dataset which we call a native white box model—see Fig. 1.

Dataset properties

Finally, we study whether there are simple (standard) properties of a dataset that may determine whether it is opaque (the best black box model outperforms the best white box) or comprehensible (the reverse happens). We use a standard toolbox, Alcobaba [57], which automatically extracts numerous characteristics (“meta-features”) for any given dataset. We consider four types of dataset characteristics from this toolbox: general ones, which capture basic information such as the number of instances or the number of attributes [58]; statistical ones, which capture information about the data distribution such as the number of outliers, variance, skewness, etc. [58]; information-theoretic ones, which capture characteristics such as the joint entropy, class entropy, class concentration, etc. [58]; and so-called complexity related ones, which, for example in the case of a classification problem estimate the difficulty in separating the data into their classes [59].Footnote 6 We opt for using a standard toolbox and set of dataset characteristics to make this analysis general, easily reproducible and simple to use in practice.

Materials and methods


We use a large benchmark study to compare the algorithms on different tabular datasets. Benchmark comparisons are usually developed over a few, typically standard data sets, as a machine learning method might perform well on some of the datasets but not generalize to a broader range of problems [43].

To perform our experiments, we use all the binary classification datasets from the Penn Machine Learning Benchmark (PMLB) suite [43]. This is a dataset suite that is publicly available on Github,Footnote 7 which consists both of real-world and simulated benchmark datasets to evaluate supervised classification methods. It is compiled from a wide range of existing ML benchmark suites such as KEEL, Kaggle, the UCI ML repository and the meta-learning benchmark. At this moment, PMLB consists of 162 classification datasets and 122 regression datasets. We focus on the binary classification datasets which amount to 90 datasets in total.

Some preprocessing was already done by the compilers of this benchmark suite. All the datasets were preprocessed to follow a standard row-column format and all the categorical and features with non-numerical encodings were replaced with numerical equivalents. All datasets with missing data were excluded, to avoid the impact of imposing a specific data imputation method. The used datasets are shown in Table 3.


Our methodology is shown in Fig. 2. For each dataset we create a training and test set, using 75% of the data for training and 25% for testing. Both the training and the test set are scaled according to the parameters of the training set with Sklearn’s MinMaxScaler.Footnote 8 This estimator scales each feature individually so that it is between zero and one on the training set. We also use a stratified split to make sure that enough labels are present for the training phase. GridSearchCV from SklearnFootnote 9 is used with its default 5-fold cross validation to tune the hyperparameters of every model. The dataset is divided in five folds, where each time another fold is taken as the validation set. GridSearchCV then performs an exhaustive search over a specified hyperparameter grid, which is reported in the Sects. "Black Box Models", for each modelling technique, and then checks on the validation set which parameter settings performed best. By doing this five times, instead of just using one validation set, we get a more accurate representation of how the model behaves on unseen data, and we are not reliant on the data we used as the validation set. We select the best hyperparameter values for each modelling technique based on this tuning. Moreover, for each dataset we also select the best surrogate model. We do this by creating a new training set, which is a copy of the original training set but with as labels the predictions of the best black box model, based on the cross-validation performance. The surrogate model is trained on this relabeled training set and can be any of the original white box models, as well as Trepan or RuleFit. The final performance of all the models(black box, white box and surrogate) is evaluated on the test set based on two metrics: accuracy and f1-score. The difference in the test set performance among the different models is shown in Fig. 3. For each dataset we select the best black box, the best white box and the best surrogate, based on their performance on the test set.Footnote 10 In our aggregate analyses, we compare the test performances of these across all datasets.

Fig. 2
figure 2


Fig. 3
figure 3

Critical difference diagram of the comparison of classifiers. Models that are not connected with a bold line have a significant difference in performance (at a 5% level with the Nemenyi test)

Black box models

We use three state-of-the-art black box models: neural networks, random forests and nonlinear support vector machines [39, 60]. As noted below, we also include in the list of black boxes the three comprehensible models when their size - after training - is very large.

Random forest We use the RandomForestClassifierFootnote 11 from Sklearn and use a grid search to tune the number of trees in the forest with values between 10 and 2000 and the number of features to consider when looking for the best split with (’sqrt’, ’none’).

Support vector machine We use the SVCFootnote 12 from Sklearn and use a grid search to tune the regularization hyperparameter with values between 0.1 and 1000 and the kernel coefficient with values between 0.0001 and 1. We use the default kernel type of rbf.

Neural network We use the MLPClassifierFootnote 13 from Sklearn and use a grid search to tune the size of the hidden layer. We only test neural networks with one hidden layer. We tune the hidden layer with sizes between 10 and 1000.

Comprehensible models

We use three models that are in general considered to be comprehensible, when their size is constrained. As discussed in the main article, we limit the size of these models to 7 (maximum number of nodes for trees, rules for rule based systems, coefficients for logistic regression). We also train these models without constraining their size. In this case, when their size after training is very large, with more than 50 elements,we consider them as part of the black boxes in our analysis.

Decision tree We use the DecisionTreeClassifierFootnote 14 from Sklearn. We use a grid search to tune the function to measure the quality of the split (gini, entropy), tune the maximal depth between 2 and 30 and tune the minimum number of samples in a leaf (2,4). We tune the maximal amount of leaf nodes between 2 and 7 for the constrained cases (white boxes) and between 2 and 1000 for the unconstrained ones (black boxes).

Logistic Regression We use the LogisticRegressionFootnote 15 from Sklearn. We use l2 regularization and the liblinear solver. We use a grid search to tune the regularization parameter values between 0.0001 and 1000.

Ripper We use a rule learning algorithm, based on sequential covering. This method repeatedly learns a single rule to create a rule list that covers the entire dataset rule by rule [22]. RIPPER (Repeated Incremental Pruning to produce Error Reduction), which was introduced by Cohen in 1995 is a variant of this algorithm [61]. We use the Python implementation of Ripper hosted on Github.Footnote 16

Surrogate models

We use the three comprehensible models above but this time we train them on the predictions of the best performing black box instead of using the training data. We also include Trepan [54], which is used for rule extraction based surrogate modeling, and RuleFit [62], which is based on an underlying Random Forest model. Again, we limit the size of the comprehensible models to 7.

Trepan We use the Python package Skater to implement TreeSurrogates,Footnote 17 which is based on [54]. The base estimator (oracle) can be any supervised learning model. The white box model has the form of a decision tree and can be trained on the decision boundaries learned by the oracle. We use the same hyperparameter settings to tune the decision trees from Trepan as for the DecisionTreeClassifier.

RuleFit The RuleFit algorithm learns sparse linear models that include automatically detected interaction effects in the form of decision rules [62]. The interpretation is the same as for normal linear models but now some of the features are derived from decision rules. We use the Python implementation of RuleFit hosted on Github.Footnote 18


First, we address the cost of comprehensibility, by testing whether native white and black box models have a significant difference in performance. To assess this cost, we use both the models’ f1-score and accuracy.Footnote 19 The figures for the latter are reported in Fig. 6. We first compare all the classifiers using the Friedman testFootnote 20 [63] to identify whether there are any significant differences between the different models, and then the post-hoc Nemenyi test [64] to identify significant pairwise differences.Footnote 21 The null hypothesis of the Friedman test is rejected with a p-value of \(2.43\cdot e^{-25}\) (a value with the same order of magnitude when using accuracy instead of f1-scores). This means that there are significant differences among some groups of algorithms. We use the post-hoc Nemenyi test to perform all possible pairwise comparisons [65]. The results are shown in the critical difference diagramFootnote 22 in Fig. 3. The performance of the black box models (RF, MLP, SVM) is significantly better than the performance of the white box models (DT, LR, Ripper), already confirming that, overall, the cost of comprehensibility indeed exists.

The cost of comprehensibility

Having established that the cost of comprehensibility exists, we study how large it is across datasets. As discussed, for each dataset we select the best black and white boxes and measure their relative difference in performance - namely, the cost of comprehensibility. Figure 4a shows the results across all datasets when we order them according to this cost. This figure reveals a somewhat surprising result: this cost is highly non-linear (e.g., the plot is a sigmoid instead of being closer to a straight line). For most datasets the accuracy-comprehensibility trade-off is low, only for a few it is very high (right) and for a few it is very “negative” indicating that comprehensible models largely outperform the black box ones for these datasets (left). Yet, for 68.89% of the datasets the best black box model outperforms the best white box model, reconfirming the overall existence of the cost of comprehensibility. The results for accuracy can be seen in Fig. 7a.

Fig. 4
figure 4

Comparing black box and white box models. For both plots, the datasets are ordered according to the gap in f1-score between the best black box and the best native (left figure) or surrogate (right) white box model (right). The y-axis measures the relative difference in the f1-score, defined as the ratio of the difference between the black and white box f1-scores divided by that of the best model

Can surrogate modeling improve the accuracy-comprehensibility trade-off?

We next investigate whether surrogate modelling can improve the performance of the (native) comprehensible models. For all datasets we generate the best black box and the best (native) white box trained on the training data, and then we also train a surrogate model mimicking the best black box one - what we previously called a surrogate white box. We compare the performance of these three types of models across all datasets in Fig. 5. As indicated in Fig. 5a, surrogate modelling does improve accuracy slightly relative to native white box models, on average across all datasets. We term this improvement the “Benefit of Explaining”, a benefit in terms of improved predictive accuracy. Based on the Wilcoxon Signed Rank testFootnote 23 [63], used to compare classifiers across several datasets, we can reject the hypothesis that the native and surrogate white boxes perform equally well (p-value 0.003) – the latter performing on average better. The result for accuracy can be seen in Fig. 8.

Fig. 5
figure 5

Comparison across datasets of best black box model for each dataset, surrogate white box model mimicking this best black box, and best native white box model. BB stands for black box and WB for white box. The line at 0 indicates the performance of the best black box model. The y-axis indicates the absolute difference in f1-score from the best black box model

We perform the same analysis, but this time for two different types of datasets: those for which the best performing model is a black box, what we termed opaque datasets, and those for which white boxes perform at least as well as or better than black boxes, what we called comprehensible datasets. The results are shown in Fig.  5b, c. Interestingly, in this case the surrogate white box models outperform the native white box models on average across the opaque datasets (Wilcoxon test p-value of \(7.72\cdot e^{-5}\)), while the two are not significantly different for the comprehensible datasets (Wilcoxon test p-value of 0.20). In the latter case there is no need to go through a black box if its performance is not better than that of a native white box [56, 67], as the latter would dominate both in terms of accuracy and comprehensibility. Hence, if one considers only opaque datasets, the use of surrogate modeling can indeed improve the accuracy-comprehensibility trade-off on average.

The cost of explainability

Next, we investigate the difference in performance between the best black box model for each dataset and the best surrogate white box model from that black box - what we call the cost of explainability. Fig. 4b shows the results when we sort all datasets based on this cost. The results are similar to what we observe for the cost of comprehensibility: the difference is small for most datasets, but very large for a few. The results are also in agreement with those in Fig. 7, where we see that the cost of explainability is a bit lower than the cost of comprehensibility (Fig. 7).

Opaque vs. comprehensible datasets

Finally, we study whether the cost of comprehensibility relates to some properties of the dataset. To do so, for each dataset we generate a number of standard dataset properties as discussed above (see also Supplementary Information material), and use them to explain the cost of comprehensibility. Specifically, we run a regression analysis using the generated dataset properties as independent variables with the dependent variable being the difference between the performance of the best black box model and the best native white box model. We used all 90 datasets, hence the number of observations used for the regression was also 90. The variables that are significant are shown in Table 2. Overall, these results indicate that properties related to the complexity required to model a dataset and the level of noise in a dataset significantly explain the cost. While this is a relatively simple analysis, the results suggest that one may be able to identify or communicate whether there is a potential cost of comprehensibility by simply reporting specific dataset properties.

Table 2 The dataset properties that are significant when explaining the cost of comprehensibility using a number of standard dataset properties as independent variables in a regression model where the cost is the dependent variable

Specifically, the following five properties are found to be significant. F1v, which is the directional-vector Maximum Fisher’s discriminant ratio that indicates whether a linear hyperplane can separate most of the data, where lower values means that more data can be separated this way [59]. L1, which is a linearity measure that quantifies whether the classes can be linearly separated [58]. Higher values of this attribute indicate more complex problems as they require a non-linear classifier [59]. These properties have a positive coefficient in the regression analysis, which means that all these factors increase the gap between the best black box model and the best white box model. The sign of these coefficients is as expected, namely that for datasets that are more complex to separate linearly, the performance of black box models compared to simple models is on average better.

Two other features, EqNumAttr and NsRatio, capture information related to the minimum number of attributes necessary to represent the target attribute and the proportion of data that is irrelevant to the problem (level of noise) [58, 68]. We see that these dataset properties have a negative relationship with the size of the cost. Note that when we analyze this result at the level of each individual prediction model, we see that these properties negatively affect both the performance of the black box models and the white box models, but more so for the black box ones. This could be because black box models may pick up more of the noise or use a lot of irrelevant features. Finally, N3 [59] is a neighbor-based measure that refers to the error rate of the nearest neighbor classifier. Low values of this dataset property indicate that there is a large gap in the class boundary [69]. We see again that this property negatively affects both the performance of the black box models and the white box models [69], and that the effect on the gap depends on how much it affects the performance of each model.


Understanding the trade-off between comprehensibility and accuracy can have important implications for regulators as well as companies [70]. Our results indicate that most of the time the trade-off is relatively small, indicating that one should consider native white box algorithms as a key benchmark. Indeed, given the non-linearities we observe, one would expect that black boxes are used relatively infrequently, even if for the majority of cases they outperform white boxes, as our study indicates that this outperformance is typically relatively small. Some papers in the literature also indicate that for certain datasets simple models work as well as complex ones [11, 12] or that for most datasets the out-performance by black box models will be very small [71], despite the popular belief that more complex models are always better. Of course it depends on the use case and application domain whether this small difference in performance is worth the loss in comprehensibility. Due to social and ethical pressure, insight in when one should opt for a comprehensible model could be a competitive differentiator and drive real business value [70]. Insights in this trade-off could lead to specific guidelines from regulators on how and when to apply AI algorithms when comprehensibility is required.

Our results also show that using surrogate modelling could reduce the cost of comprehensibility, especially for opaque datasets. As we discussed, this may be the case because the black box model in between can filter out noise and anomalies [53, 56]. We also see that simple properties of a dataset could provide insights (for example to a third party such as a user or regulator) in the nature of the trade-off without requiring knowledge of the algorithms tested or the data used. For example, attributes that measure how difficult it is to linearly separate the data are significantly correlated with the size of the gap. Indeed, one would expect that for these datasets black box models might be better in capturing the non-linearities. This can lead to practical tests of the feasibility of using a native white box – and the potential accuracy loss – in a given use case.

Our general findings suggest the following guidelines:

  1. 1.

    Start with white box models.

  2. 2.

    Train additional black box models if: (a) the application allows for a (possibly small) increase in performance at a cost of comprehensibility, and, (b) the level of noise is high and the data requires complex modeling, as indicated by the listed, easy to calculate dataset metrics.

  3. 3.

    If there is a practically important cost of comprehensibility (hence you are dealing with an opaque dataset), apply additional surrogate modeling algorithms.

Finally, we note that in this study we focused on tabular datasets. For other kinds of datasets, the trade-off we study may be different. For example, for image or text data, more flexible models are needed to handle the data complexity [9, 13] and the difference in performance between comprehensible models compared to black box ones such as deep learning is often considered unbridgeable [8].