In their study, Mayr et al. concluded that “deep learning methods significantly outperform all competing methods.” Much of this conclusion is based upon small p-values resulting from a Wilcoxon signed-rank test used to quantify the differences in the average performance of the classifiers. For example, they report a p-value of \(1.985 \times 10^{-7}\) for the alternate hypothesis that feedforward deep neural networks (FNN) have a higher area under the receiver operating curve (AUC–ROC) than support vector machines (SVM). For the alternative hypothesis that FNN outperform Random Forests (RF), the p-value is even more extreme (\(8.491 \times 10^{-88}\)). From such low p-values, one might be led to believe that FNN is the only algorithm worth trying in drug discovery. Yet, a closer look at the data reveals that this conclusion is clearly erroneous and obscures much of the variability from assay to assay.
Are all assays created equal?
To demonstrate the problems, we begin with an initial example of SVM and FNN performances using ECFP6 fingerprints. Table 1 shows AUC–ROC results from the FNN and SVM classifiers for two assays in the Mayr et al. dataset. Assay A is a functional assay consisting of a small number of samples. Each fold is heavily imbalanced and consists mostly of active compounds. Importantly, this is often the opposite imbalance one would observe in a real screen. As is expected with a small amount of highly imbalanced data, both the FNN and SVM classifier show highly variable results with very large confidence intervals. In fold 2, where only a single active compound is present in the test set, it is not even clear how to calculate the confidence intervals for AUC–ROC. The mean and standard error of the mean (SEM) are also calculated over the threefold, though this is slightly dangerous since it discards all of our knowledge of uncertainty in each fold.
Table 1 Two separate assays from the Mayr et al. data with the accompanying FNN and SVM prediction results In contrast, assay B is a functional assay with large samples and imbalances that more closely resemble those typically seen in the literature. Performance is quite good, with the SVM classifier outperforming the FNN classifier on each fold. Furthermore, the confidence intervals for each AUC–ROC value are quite small. Again the mean and SEM are calculated across the folds for each classifier. Additionally, Fig. 1 gives a visual representation of the performances for assay A and assay B.
One would likely agree that Fig. 1 shows a striking difference between the results of the two assays. While the results of assay A for FNN and SVM are extremely noisy and raise many questions, assay B shows a well-defined difference in the performance of the two algorithms, even relative to the noise levels of the measurements. Though not a formal analysis, due to the presence of noise, one would likely consider the difference in mean performances on assay A, \(0.67 - 0.57 = 0.10\), to be much less meaningful than the difference in mean performances on assay B, \(0.929 - 0.900 = 0.029\). To most practitioners, comparative performances on assay B would give much more evidence to SVM outperforming FNN than the comparative performances on assay A, even though it is lower in magnitude. More formally, one can compare the effect size, which measures the difference in the performance of the two models relative to the standard deviation of the data, and thus accounts for the variability of the respective datasets. The effect size can also be related to the probability of one model outperforming the other [24]. Note that the p-value is dependent on sample size whereas the effect size is not—with a sufficiently large sample, relying on p-values will almost always suggest a significant difference, unless there is no effect whatsoever (effect size is exactly zero). However, very small differences, even if significant, are practically meaningless. Using the canonical definition for effect size from Cohen (Cohen’s d):
$$\begin{aligned} d=\frac{\mu _{2}-\mu _{1}}{\sqrt{\frac{s_{1}^{2}+s_{2}^{2}}{2}}}, \end{aligned}$$
where \(\mu _1\) and \(\mu _2\) are the mean performances and \(s_1\) and \(s_2\) are the standard deviations of the data, not the means. Using this formula, the effect size in assay B (4.40) is approximately eight times the size of the effect size in assay A (0.55). Nevertheless, a problem arises because the Wilcoxon signed-rank test used by Mayr treats the noisy, less informative assay A as greater evidence than assay B for the superiority of SVMs over FNNs.
The Wilcoxon signed-rank test is a non-parametric paired difference test often used when determining if matched samples came from the same distribution. The test is perhaps best explained by example: imagine two methods, \(M_{\text {shallow}}\) and \(M_{\text {deep}}\) are used on a variety of prediction tasks. These prediction tasks vary widely and the algorithms have vastly different expected performances on each task. For our example, consider the sample tasks of predicting a coin flip (COIN), predicting the species of a flower (FLOWER), and predicting the label of an image (IMAGE) among many other varying tasks. The tests to verify the accuracy (ACC) on these differing tasks are also quite different. For the coin prediction tasks, only ten coins are tossed. For the flower task, around 100 flowers are used. And for the image task 10,000 images are used. Thus, our made up results when comparing the models may look like so (Table 2):
Table 2 The results form our imagined example As our imagined table shows, the difference in performance between the methods is most drastic for the COIN dataset, which is also the most noisy, as shown by the large confidence intervals. However, this result is also the most meaningless of the three shown, since we know that all methods will eventually converge to an accuracy around 0.5 in the large number of test samples. The IMAGE dataset is likely the best indicator of the superior method (at least on image problems), but the “rank” of the difference (after ordering all the absolute differences) is quite small compared that of the COIN, and potentially many other unreliable tests of performance. Unfortunately, since the Wilcoxon signed-rank test only relies on the signed-rank (rank of difference multiplied by the sign of the difference), all information regarding the variability in a given test is discarded.
The null hypothesis of the Wilcoxon test is that the differences in the methods are distributed symmetrically and centered on zero. The test statistic W is simply the sum of the signed ranks and has an expected value of zero, with a known variance. As a result, the larger magnitude differences between the two methods will be considered more important by the test, due to their high ranks. Unfortunately, in our illustrative example, the highly ranked differences are not those that give the best evidence of differences between the methods.
Coming back to the example from Mayr and coworkers, the test treats the difference in performances on each assay as commensurate, and assumes that the larger magnitude difference of mean AUC–ROC values in assay A should carry more weight than the smaller magnitude difference of mean AUC–ROC values in assay B. This, again, is not necessarily true.
Instead, effect size, which measures the magnitude of the difference relative to the uncertainty, is more important than pure magnitude differences. As another complication, differences in AUC/probability space are not straightforward: \(p=0.01\) and \(p=1\times 10^{-6}\) have a smaller difference in absolute magnitude than \(p=0.51\) and \(p=0.52\); however, the difference between one in 100 and one in a million is likely much more important than the difference between 51 and 52%. Lastly, these concerns aside, assuming commensurate results were already problematic given the heterogeneity of the assay types, changing sample sizes, varying imbalances, and diverse target classes.
A different test, a different question
Having realized that the Wilcoxon signed-rank test is inappropriate, we turn to the sign test as perhaps the most appropriate procedure. The sign test essentially counts the proportion of “wins” for a given algorithm over another on all of the datasets. That is, we simply consider the sign of the difference in performance between the methods. The test allows us to probe the question: “on a given dataset/assay, which of the methods will perform the best?” This question addresses the concern of a practitioner implementing a bioactivity model and considering many potential predictive models. In contrast, the statistic of the Wilcoxon signed-rank test is much less interpretable, providing less clarity to the user.
As with many of these tests, the null hypothesis of the sign test is that the two models show the same AUC–ROC performance on the datasets. Assuming this null hypothesis, the algorithm displaying the better performance on a given assay should be determined by a coin-flip. Therefore, given N assays, we expect each classifier to win on approximately N/2 assays. In our illustrative example, if both \(M_{\text {shallow}}\) and \(M_{\text {deep}}\) were tested on 100 different datasets, each method would be expected to “win”, i.e. outperform the other method, on approximately 50 of the datasets. Obviously, variability is expected, which can be quantified by deviation from the expected binomial distribution.
There are, of course, still problems with the sign test. First, the test still discards most of the uncertainty information. Secondly, the test still counts the assays in Table 1 and Fig. 1 of equal weight, which is better than in a rank test, but still suboptimal. Additionally, due to the lack of parametric assumptions, the sign test has low power, meaning that it often fails to detect a statistically significant difference in algorithms when one exists.
Using the sign-test we calculated 95% Wilson score intervals for the sign-test statistic for the alternative hypothesis that FNN has better AUC–ROC performance than SVM, the second best performing classifier according to Mayr et al. Using all 3930 test folds in the analysis (since each is indeed an independent test set) gives an interval of (0.502, 0.534), while only comparing the mean AUC values per assay gives a confidence interval of (0.501, 0.564). While both of these tests are narrowly significant at the \(\alpha =0.05\) level (intervals do not include 0.5), it is worth examining the practical meaning of these results.
According to the statistic, our data is compatible with an FNN classifier beating an SVM classifier on 50% to 56% of the assays, Thus, if one were to conclude that only an FNN classifier is worth trying, the user would be failing to use a better classifier almost 50% of the time! And this is in the case of a two classifier comparison. Considering all the classifiers, FNN and SVM both perform the best in 24% of the assays, while every other classifier considered by Mayr et al. is the best performing classifier on at least 5% of the assays (Table 3 shows a breakdown of wins). Clearly, some of these results are just noise due to small assay sizes; however, it indicates that classifier performance is likely assay dependent, and one should try multiple classifiers for a given problem. (It is also noteworthy that the dataset comprises many assay types, e.g. enzyme inhibition/binding affinity/efficacy, which are qualitatively different.)
Unfortunately, though the sign test may be an improvement, the act of averaging results over many heterogeneous assays still fails to properly quantify the applicability and robustness of each model. Reporting merely the average performance occludes the success of each method on assays with differing makeups of actives and inactivates, similarities among molecules, and levels of noise.
Table 3 The percent of test folds, across all assays, that a method is the best performing method The above considerations are illustrated by the data in Figs. 2 and 3. Figure 2 shows that the FNN and SVM performance is almost identical for large datasets, but the difference between the performances varies quite sporadically for assays with fewer compounds (the smaller points in the figure). Additionally, Fig. 3 shows the best performing algorithm for each independent test fold; we also plot the other algorithms that Mayr et al. considered, namely random forest (RF), k-nearest neighbours (kNN), Graph Convolutional neural networks (GC), Weave, and Long Short-Term Memory networks with SMILES input (LSTM). As one can see, the results are quite varied for smaller assays, and the best performing algorithm is largely dataset dependent. Much of this variation is due to the threefold CV procedure of Mayr et al. that is quite susceptible to large variations because of the small dataset size.
However, as the training size increases, the deep learning and SVM algorithms dominate. Interestingly, among all datasets with greater than 1000 compounds in the test set, SVM performance is better than FNN performance on 62.5% of assays, which is counter to the usual wisdom that deep learning approaches beat SVMs in large assays. Notably, GC, LSTM, and Weave, show the best performance on only a small number of large assays, casting doubt on their utility over a standard FNN or SVM. With all of these observations, it should be noted that the results could be due to sub-optimal hyperparameter optimization—and perhaps some of these models can achieve state-of-the-art performance in the hands of expert users. However, hyperparameter optimisation can take a considerable amount of time and computing resources.
Additionally, the correlation between the mean AUC–ROC performances for all models is shown in Fig. 4 for the 177 assays with more than 1000 test set samples on average over the threefold. As can be seen, most of the deep learning models perform quite similarly, with NB, KNN, and Weave seeming to show the worst performance. The figure showing how well the models correlate on assays of all sizes is also included in the supporting information. Unfortunately, it is tough to make inferences regarding the relative performance in small sizes due to the inherent noise of the datasets and threefold CV procedure.
Taking all of the above into consideration, it appears that the FNN and SVM models are the best performing models, especially in the case of large datasets. In small datasets, NB, KNN, and RF can often still perform competitively. It is also unclear how well the frequently used gradient-boosted decision tree algorithm would compare, since it was not included in the study. The Mayr et al. data contains quite a lot of information and we provide it and our code online for all who wish to further analyze it.