Confidence limits, error bars and method comparison in molecular modeling. Part 1: The calculation of confidence intervals
 4.4k Downloads
 25 Citations
Abstract
Computational chemistry is a largely empirical field that makes predictions with substantial uncertainty. And yet the use of standard statistical methods to quantify this uncertainty is often absent from published reports. This article covers the basics of confidence interval estimation for molecular modeling using classical statistics. Alternate approaches such as nonparametric statistics and bootstrapping are discussed.
Keywords
Statistics AUC Virtual screening enrichment Correlation coefficients Linear regression Error bars Confidence intervalsIntroduction: Error bars
When we report a number what do we mean by it? Clearly our intention is to convey information: after an experiment we think a property has a certain value; after this calculation our prediction of quantity X is Y. In reality, we know whatever number we report is only an estimate. For instance, we repeat an experiment to measure a partition coefficient between water and octanol five times and get an average, or we apply a computer model to a set of ten test systems and calculate a mean performance. In the former case, will a sixth measurement produce a similar number? In the latter case, do we know if the program will perform as well over a new test set? In other words, how do we know if these numbers are useful?
In statistics utility is about being able to say something concerning the population from a sample. Here population means “everything”, e.g. it could mean all members of a set, or all (infinite) repeats of an experiment. When we test predictive software we hope the average over a set of systems represents what we might get from the population of all possible test systems, including ones not yet imagined. For a physical property measurement we assume our experiments sample the possible range of small variations in conditions, what we call ‘random variables’, in an even and comprehensive way such that the ‘population’ of all such experiments is represented. In either case we know that we have only sampled, not enumerated all possibilities. As such there is an uncertainty in our number. In fact, without an assessment of this uncertainty, or a description of how to estimate it, what we have really delivered is a report, not a prediction; “we did X, followed by Y, and got Z”. In a completely general sense, i.e. from information theory, it can be shown that without at least some estimate of uncertainty a single value technically has no information—essentially because it is represented by a delta function in the probability distribution of possible values, which has a vanishing overlap with the actual distribution of values of the population. In reality a lone number does have some usefulness because we assign it a default sense of certainty from our experience. However such a sense can often be misleading, for instance our default may be wildly optimistic! A rigorous way of incorporating such prior knowledge is the Bayesian framework. Although Bayes approaches are very powerful and general, they lie outside the scope of this article. Interested readers should consider such excellent works as [1, 2, 3, 4, 5].
Here μ is the center of the function, our best guess at the average value, and σ is related to the width of the function, our uncertainty (a smaller σ means a narrower Gaussian, larger σ means a wider one). We only need to know σ to state what fraction falls within a given range of μ. The ubiquity of this description is a consequence of the famous “Central Limit Theorem” (CLT). The CLT says that if one samples from some distribution, no matter what that distribution looks like, the distribution of the average of that sample can be expected to look more and more like a Gaussian as the number of samples grows. This does not mean that the ‘true’ value is asymptotically approached; there might be an experimental bias away from the actual value. What the CLT tell us about is the reproducibility of the experimental setup we are using, i.e. it is concerned with precision, not accuracy.
The above description is typically taught in introductory classes to the scientific method, and experimentalists rarely forget it because reproducibility is their core concept. The same cannot be said for theoreticians. The presentation of error estimates, whether reasoned or derived, is rare in the field of computational chemistry. Perhaps this is because of the mistaken belief in the exactness of theory. Evidence for this would be a discernable inverse correlation between the level of theory in publications and the sophistication of any accompanying statistics. Or perhaps it is because practitioners see only a single number from a calculation that can be reproduced by rerunning the program. Of course, this belies the fact that small changes in either inputs to the program, programming presets or even the computer architecture or operating system can lead to great variability [6]. A third possibility is simply that knowledge is lacking. This author, for example, realized some years ago that he had only a very rudimentary knowledge of statistical assessment. It is to the latter possibility that this paper is addressed, to present, in the context of molecular modeling, how basic error bar evaluation and comparison should be done.
This goal turned out to be a considerable undertaking. The field of molecular modeling is diverse and complex, incorporating many levels of theory and physical approximation. Attempting to cover all eventualities is beyond the scope of a journal article. However, there are many common tasks and principles that would be of use if more widely known and this paper attempts to organize and present such a collection. In order to keep even that goal within reasonable bounds the statistics introduced here will be largely what is often referred to as “classical”. By classical we mean it is Frequentist and “parametric”. The term Frequentist refers to the school of statistics developed by Fisher, Pearson, Gossett, Neyman and others during the late 19^{th} century and first half of the 20^{th} century. It is based on the concept of reproducibility, of being able to imagine repeating events or experiments an arbitrary number of times. As such, quantities calculated from Frequentist approaches are “asymptotic”, by which is meant the key aspects are often just how many reproductions are necessary to give a certain level of confidence in a prediction. The Frequentist approach is often contrasted to the Bayesian approach, which is different in its focus on the past; what did we know and how does that help us make sense of what we did next. The advantage of the Bayesian approach is that it adapts more easily to the real world, e.g. some experiments we really cannot rerun. However, the Bayes formalism often requires numerical simulation and, in fact, really became popular only once computing power was widely available. The advantage of the Frequentist approach is a wealth of carefully constructed formulae that can be used to address all kinds of problems.
The availability of many of these formulae is due to the second part of the description of the work presented here, i.e. that the statistics are “parametric”. This term is often made synonymous with statistics that assume a Gaussian distribution of random variables, although more properly it applies to any approach where a functional form has been assumed for the distribution of some quantity, a functional form controlled by some “parameters”. In the case of Gaussians it is the center and the width, but there are other functional forms, e.g. Binomial, Poisson, Laplacian, Cauchy etc., with their own characteristics. Nonparametric, classical statistics do not make assumptions about the form of distributions and, as such, are more general. A few will be mentioned in this article. However, the focus will be on classical, Gaussianbased, statistics. The first reason is that classical statistics usually give a simple way to rapidly assess likely error and how this error decreases with sample size. More than other approaches, they can be “aids to thinking”, rather than magic boxes producing numbers. The second reason is to keep this report of manageable length.
Even with these decisions it has been necessary to split this paper into two parts, corresponding to the two definitive uses of confidence limits: comparison to fixed values, for instance the height of a levee, and comparison to other confidence limits: such as comparing prediction methods. Both uses are valuable; if your company gets a milestone payment if it identifies a onenanomolar lead compound, then the accuracy of the measurement of that affinity is of some importance. If you are comparing two (or more) models of activity you will waste a lot of time and resources if you cannot tell which is more accurate. As such, the comparison of properties with associated confidence intervals will be described in a subsequent paper, with the focus here on the estimation of a single error bar.
 1.Basic principles
 a.
Standard terms
 b.
The Gaussian distribution
 c.
One or twotailed significance
 d.
Long tails
 e.
The test statistic, t
 f.
The origin of the squareroot in asymptotic error
 g.
Reporting data, box plots
 h.
The error in the error
 a.
 2.Small sample size effects
 a.
The Student t distribution
 b.
p values and the Student test statistic
 a.
 3.Useful analytic forms for error bars in modeling
 a.
Probabilities
 b.
Area under the (ROC) curve (AUC)
 c.
Virtual screening enrichment
 d.
Linear regression (straightline fit) properties
 e.
Pearson’s correlation coefficient, r
 a.
 4.Asymmetric error bars
 a.
Probabilities
 b.
Area under the (ROC) curve (AUC)
 c.
Virtual screening enrichment
 d.
RMSE
 a.
 5.Combining errors from different sources
 a.
General formulae and examples
 b.
The general error formula
 c.
Estimating unknown contributions to error
 d.
Assessing adding noisy systems to test sets
 e.
Varianceweighted averages with examples
 f.
Weighted averages of variances
 a.
 6.Bootstrapping error bars
 a.
Introduction
 b.
Limitations
 c.
Advantages
 a.
Basic principles
Standard terms
As such, there are really only (N − 1) variables in the equation for s _{ N }. This explanation always seemed mysterious to this author! As such, “Appendix 1” includes a simple proof that using (N − 1) gives an estimate of the SD that is unbiased, i.e. in the limit of large sample sizes the sample mean will approach the population mean by being slightly larger or slightly smaller with equal likelihood. In many cases it is not obvious how many degrees of freedom there are; sometimes approximations are employed that give fractional degrees!
The Gaussian distribution
One or twotailed significance
An important distinction needs to be made here as to the “sided”ness of areas under a Gaussian. The 95 % confidence limits that are usually mentioned refer to the possibility of a value being larger or smaller than a given range. But suppose we are interested in whether a value is larger than a given value? For that we do not have to consider the lower range—our interest is “onetailed”, i.e. only concerns one tail of the distribution function. For instance, in the above example, there is only a 2.5 % chance the actual value is greater than 0.7. Onesided comparisons go with questions such as, “Is quantity A greater than a value X”, or “Is quantity A less than a value Y”, but not both (in which case the twotailed distribution is required). A focus of classical statistics, possibly to its detriment, is whether two things are different. In such tests we are agnostic as to which might be better and which worse, only if they are distinguishable. As such, the 95 % range and its association with roughly two standard deviations from the mean is appropriate. However, when we are asking a more specific question: is drug A worse than drug B, is treatment C better than no treatment, we are asking a onetailed question. As this issue is more germane to the comparison of quantities, i.e. to confidence limits on differences of properties, further discussion will be postponed until the second paper of this series.
Long tails
Much of the criticism of classical statistics concerns the tails of the Gaussian distribution not being accurate. For example, Taleb and others [10, 11] have pointed out that the distribution of returns on stock investment is Gaussian (i.e. random) for short time intervals but that rare events (“Black Swans”) appear much more frequently than expected. Taleb cofounded an investment vehicle, “Empirica Capital”, based on this principle, i.e. designed to lose money when the stock market was behaving in a regular, “Gaussian” manner, and yet to win big when it deviated from this behavior. There is considerable work in the area of unlikely events and their distributions, socalled “extremevalue” distributions such as the Gumbel, Fréchet or Weibull distributions [12]. In addition, we will consider the most applied “longtailed” function, the Student tfunction, shortly.
Test statistic, t
The origin of the square root in asymptotic error
The fact that the error in an average goes down with the square root of the number of observations was not always appreciated. Examples of it not being known can be dated back to the Trial of the Pyx, 1282 AD [13]. The Trial was a test designed by the English Royal Mint to check for unlawful deviations in the weight of the King’s coinage and it was assumed such variation would be linear with the number of coins tested. Thus an unnecessarily large tolerance was assumed allowing the unscrupulous but mathematically astute to ‘game’ the King’s system, at some peril naturally! Even today it is at the root of many misunderstandings of published data [14].
Reporting data, box plots
Although 95 % is a standard for a confidence interval, there are variations worth knowing. The first is that “1.96” is often simply replaced with “2.0”, a ‘twosigma’ error bar, since this makes very little difference (95.4 % compared to 95.0 %). However, this leads to error bars that are based on the number of sigmas, not a percentage. So, for instance, a onesigma error bar (not uncommon) contains 68.2 %, roughly twothirds, of expected outcomes; a threesigma error bar (unusual) contains about 99.7 %. It is important a researcher is clear as to which is being presented—especially if the smaller onesigma error bars are reported. No report is complete unless the meaning of presented error bars is explicitly recorded.
Why are box plots nonparametric, i.e. why the ‘median’ and “quartiles”, rather than the mean or SD? The answer lies in Tukey’s interest in “Robust Statistics” [15]; a field he helped to create. Robust statistics attempts to address the problems that outliers can cause to traditional, parametric, Gaussianbased statistics. For instance, a single outlier can shift the mean (or SD) of a set of measurements an arbitrary amount; the mean (or SD) is ‘fragile’ with respect to a single measurement. Contrast this to the median (or quartile) where adding a single value, no matter how extreme, can move the median (quartile) only to an adjacent measurement. Ripley nicely describes Tukey’s investigations and those that followed in robust statistics [16].
The error in the error
Small samples
The Student t distribution
p values and the Student test statistic
Table showing the 95 % Student t test statistic for different numbers of data points, N
N (ν + 1)  t _{95 %} 

2  12.71 
3  4.30 
4  3.18 
5  2.78 
10  2.26 
20  2.09 
50  2.01 
100  1.98 
∞  1.96 
As can be seen, you need about twenty samples before the inaccuracy in the prefactor of 1.96 is less than 10 %. Consider the case where a standard deviation is being estimated from a measurement done in triplicate—the t statistic is more than twice what one would expect for large sample sizes!
The ramifications of Student’s work were slow to materialize but were eventually recognized as fundamental to practical statistics in industrial settings. It also illustrates one of the issues with “classical” statistics, i.e. its reliance on lookup tables. That is inevitable because the functions that describe the different probability distributions are not common outside of statistics. These days, however, it is easy to find such tables online.
One further aspect of the Studentt that has only become appreciated in recent years is that it can also be used to improve the robustness of linear regression [18]. This is because the long tails of the Student function better tolerates outliers, i.e. the “unlikelihood” of an outlier does not outweigh the “likelihood” of “inliers” as much.
Useful analytic forms for error bars in modeling
We present here some known and some new results for error bars of typical measures of importance in computational chemistry, namely, (1) the probability of an event, (2) the area under the curve (AUC) for receiver operator characteristics (ROC), (3) virtual screening enrichment, (4) Pearson’s R ^{2} and (v) linear regression. Analytic results may seem oldfashioned when modern computing power can simulate distributions, e.g. via bootstrapping, but they can be invaluable when the primary data is not available. In addition, they allow us to think about the contributions to the error terms in a way that simulations do not. Finally, as will be discussed below, there are occasions when the prior knowledge they represent can be helpful in producing more robust estimates.
Probabilities
Area under the (ROC) curve (AUC)
The ROC AUC is equivalent to the probability a randomly chosen true event is ranked higher than a randomly chosen false one, i.e. the higher the AUC the greater the ability of the property to distinguish true from false. In what follows ‘true’ will mean active, e.g. an active molecule, a correctly docked molecule etc., whereas ‘false’ will mean an inactive molecule, an incorrectly docked molecule etc. A subtlety arises as to how ties are managed. The simplest prescription, followed here, is to count a tie as one half of its normal contribution.
The expected accuracy of the AUC will depend on the number of actives and the number of inactives. Consider each active in turn. It contributes to the AUC by the fraction of inactives for which it ranks higher. Since this contribution is a probability, accuracy of this property will depend on the number of inactives. We then combine the probability of this active with the similar probability for all other actives. This average of probabilities will have its own distribution, the tightness of which will depend on the square root of the number of actives. Thus there are two sources of error. In a later section we shall more generally consider the situation of multiple contributions to error but this is an example of just such, i.e. error from the finite number of actives and from the finite number of inactives.
It is worth noting that this is not the first time the accuracy of the Hanley approach has been examined. Cortez et al. [24] compare their exhaustive enumeration of ROC curves with the Hanley result and that from Delong’s formula and found any improvement was marginal.
Virtual screening enrichment
A common complaint against the use of the AUC curve is that it does not measure the quantity of interest, i.e. the segregation of actives to the very top of the list, the ‘early’ enrichment. Such claims are misinformed, as the AUC is a reliable estimate of early performance; in fact, averaged over many systems it is a better estimate of early enrichment than artificial measures that have been ‘designed’ to reflect this quantity [27]. This is because the AUC uses all the data and so it is more statistically reliable (√N is larger). For instance, it can be shown that the AUC is more robust to the inclusion of “false false positives”, i.e. compounds that are assumed inactive but are actually active [27]. The second reason is that although a single AUC value may mislead as to early performance, e.g. the ROC curve might have a sigmoidal shape where the early enrichment is poor but some less relevant middle enrichment makes the AUC look good, averaged sets of ROC curves tend to look very ‘canonical’, i.e. have Hanley shapes [27]. Such averages of AUC correlate very well with measures of early enrichment, but with much better statistical properties.
Despite the above observation, the field is attracted to measures of early enrichment, typically defined as the ratio of the percent actives recovered when a given percent of the database has been screened to the expected percent of actives if they were indistinguishable from inactives. For instance, if 10 % of the database has been screened and 20 % of all actives have been found then the enrichment is 2.0. This deceptively simple formula has a flaw that makes it undesirable as a metric—it depends on the ratio of inactives to actives [28]. It makes little sense to choose a metric that depends on an arbitrary, extrinsic aspect of the system, e.g. the relative numbers of active and decoys. Metrics should be intrinsic, e.g. how well does this docking program work, not how well does this docking program work given this ratio of actives to inactives—something that will clearly not be known in advance in a prospective application.
Note this saturation effect has nothing to do with the total number of actives and inactives, just their ratio and it clearly gets worse at smaller enrichment percentages. At 1 % enrichment you need R > 99, for 0.1 %, R > 999 and so on. And, of course, this saturation effect is noticed before the enrichment limit is reached. One approach would be to simply make sure inactives are always in great excess. Better, though, is to redefine the enrichment as the fraction of actives found when a given fraction of inactives have been found. This metric, which we will call the ROC enrichment [28], is essentially the ratio of Y to X of a point on the AUC curve. It is independent of R and is an intrinsic property of the method. It also allows for straightforward calculation of the expected error because both the fraction of actives and the fraction of inactives can be looked upon as probabilities, for which we can calculate variances.

A = total number of actives

I = total number of inactives

f = fraction of inactives observed at a threshold T

g = fraction of actives observed at the same threshold

e = ROC enrichment

S = dg/df = slope of the ROC curve at point (f, g)
Linear regression (straightline fit) properties
Although there are obvious drawbacks in assuming a linear relationship between predictor and predicted, it can also make a lot of sense. It is often the simplest model beyond the average of a set of experimental values (a “null” model that itself ought to be applied more often). In addition, although we deal with complex systems, we often assume that while one variable cannot explain an effect entirely, everything left out in the explanation might be proportional to what is left in, i.e. that our key variable merely needs to be scaled. Examples of this are simplified molecular polarization models wherein the induced field is assumed linearly proportional to the field from the static molecular charges, i.e. the assumption is made that subsequent induction caused by this ‘first order’ induction can be captured by a scaling factor. Methods such as Generalized Born [29] use this ansatz. The scaling of charges to mimic polarization in force fields is a similar example (it is presumed polarization energies are proportional to the increase in Coulombic interaction between molecules with scaled dipole moments). The approach is widely used in other sciences; for example in simulating the properties of stellar bodies it is sometimes easier to model the electrodynamics than the magnetohydrodynamics [30]. Similar ansatz occur in nuclear physics (e.g. the Bethe–Weizaecker formula of the liquid drop model), quantum mechanics (e.g. functional construction in Density Functional Theory), statistical mechanics (e.g. liquid theory virial expansion cutoffs) and solidstate physics (e.g. effective interaction potentials of quasiparticles).
Even though models are not necessarily linear, it is typical that linear regression quality is often used as a measure of model quality. Given the widespread use of linear regression, it is surprising that straightforward estimates of the likely errors in the slope and intercept are seldom published.
 (i)
At the center of the data range the expected error in y is the average error divided by √N, as if all N points were variations the mean x value.
 (ii)
Away from the center the error is magnified by a hyperbolic term
 (iii)
This magnification is scaled by the inverse of the range of the data.
A realworld example of this occurs in the estimation of vacuumwater transfer energies. These estimations are very useful in testing theories of solvation but the direct measurement of these energies is difficult. Indirectly one can use the combination of vapor pressure and solubility, whether from the liquid or solid form, i.e. the transference from vacuum to water can be thought of as a two stage process: (1) from vapor to solid or liquid form (minus the vapor pressure), then (2) from solid or liquid form to solvated form (solubility).
Typically this equation is used to extrapolate for vapor pressure P to a temperature of interest. As such, errors in the slope and the intercept can both play a role in the estimation of the vapor pressure that goes into estimation of the solvation energy [31]. de Levie [32] has a similar example for the estimation of room temperature rate constants, along with more indepth analysis of this common problem.
Pearson’s correlation coefficient, r
 1.
Calculate r
 2.
Calculate the value of F(r)
 3.
Add and subtract {tstatistic/√(N−3)} to obtain confident limits for F.
 4.
Backtransform these confidence limits into rvalues. These are the rvalue confidence limits, and will be asymmetric.
Pearson’s r is central to many disciplines; it has become the sine qua non of “discovery”, i.e. is there an effect or not. As such it is important to understand its limitations, for instance the expected error, the sensitivity to outliers, and what it tells us about the underlying causes of the correlation. This topic will be explored further in the followon article in which we consider the comparison of rvalues.
If the distributions of errors are roughly Gaussian, and the relationship is linear, then there are formulae that can interconvert between r and τ and ρ and provide error estimates for each [35]. Notably, simply dividing the (N − 3) term in equations dealing with the confidence intervals for r by the constant 1.06 gives equivalent significance values for ρ. Also, τ and ρ are much more robust (outliers can rearrange at most 1/N of the total number of rank comparisons) and they can find any monotonic relationship, not just linear correspondences.
There also exist a wide range of “pseudo” rsquared quantities that can be used for categorical variables, such as McFadden’s, Efron’s, Cox and Snell’s, and the Nagelkerke or Cragg and Uhler’s [36]. These feature as analogs of Pearson’s r but for logistic regression, i.e. when what we want to predict is essentially binary, e.g. active or inactive. The process of logistic regression is appropriate to many aspects of computational chemistry; however, there are few applicable insights into its error analysis from classical statistics and so it falls outside the scope of this article.
Similarly, there is a variant of McFadden’s pseudo rsquared that penalizes parameters. Such variants are purposed towards model comparison and not estimating quality of correlation. Furthermore, there are reasons to prefer other tests for comparing parameterized models, such as Fisher’s Ftest [38], or tests that include parameter penalties from information theory, e.g. Akaike’s Information Content (AIC) or Schwarz’s Bayes Information Criteria (BIC) [39, 40].
Asymmetric error bars
As we saw for Pearson’s r, error bars can be asymmetric when there are fundamental bounds to the confidence limits. The way forward in such cases is to transform to a variable that is unlimited, and hopefully with an error distribution that is more symmetric and Gaussian. We then calculate error bars for this new variable, and finish by transforming these error limits back to the original variable. In this section this process is examined for the quantities of interest considered above.
Probabilities
 1.
Calculate the standard error of the input p, SE(p), i.e. sqrt(p(1 − p)/N)
 2.
Calculate f(p)
 3.
Multiply SE(p) by (df/dp) to get the standard error SE(f).
 4.
Calculate f ± t _{95 %} * SE(f).
 5.
Backtransform these two values to obtain the confidence interval in p
The Area Under the (ROC) Curve (AUC)
As stated previously, an AUC for an ROC curve can be interpreted as the probability a randomly chosen active has a higher score than a randomly chosen inactive. As such, we simply follow the same multistep procedure as above for probabilities, but where we substitute the formula for the standard error of the AUC, i.e. Eq. 25c, rather than the standard error for the probability. As such, there isn’t such a nice analytic formula as for probability but the procedure is straightforward, i.e. we have to follow the prescription of transformation, calculate the transformed limits, and then back transform to obtain the limits in terms of a probability.
Virtual Screening Enrichment
As described above, one can define an enrichment quantity that is bounded by zero and one, i.e. the ROC Enrichment scaled by the percent of inactives. This can also be treated as a probability; it is the probability that an active is seen before a given percent of inactives. As such, this quantity can be treated by the standard procedure, i.e. transform the scaled measure, scale the variance using the derivative of the logit function, calculate the confidence limits in the transformed space, back transform and finally scale to a ROC Enrichment by dividing by the percent of inactives. The question of the number of effective degrees of freedom follows a similar treatment as with AUC, i.e. the Welch–Satterthwaite formula, Eq. 26, whose elements are the individual variances of actives and inactives and their respective counts. If the number of decoys is much larger than the number of actives then the latter is used as the effective number of degrees of freedom.
If the more traditional definition of enrichment is being used then we should first transform to the equivalent ROC enrichment numbers. This is important because we cannot extract a simple probability from the traditional enrichment value because of saturation. Saturation means that the apparent probability we see, i.e. the probability an active is found in a given percent of the database, is dependent on other factors, such as the ratio of actives to inactives. Transforming to the ROC enrichment gives a pure probability for which confidence limits can be established. These can then be transformed to traditional enrichment numbers. The formulae for these transformations can be found in “Appendix 4”.
RMSE
The χsquared function has many uses in statistics. It is used in assessing the quality of fit of an observed distribution to a theoretical prediction, as in the classic Chi squared test, in assessing if classification criteria are independent, in nonparametric tests such as Friedman’s Test for distinguishing ranked quantities, in the derivation of Fisher’s Ftest (basically a ratio of two χsquared functions), which can be used to see if the addition of parameters sufficiently improves a model. Here all we need are ranges for 95 % confidence limits. In the examples from the Basics section we had an RMSE of 2.0 kcal/mol for some affinity prediction based first on fifty samples and then on eight.
Example 1: Fifty samples
Example 2: Eight samples
Combining errors from different sources
General formulae and examples
The general error formula
Take, for example, the error in the pKa estimation of a functional group. This might involve the error in the measurement of the pKa of a model compound, the expected error in moving from the model to the actual compound, the error in the estimation of the influence of an external potential from some distal group, etc. Each term is introduced, perhaps empirically, as adding its own contribution to the variance of the composite variable, in this case pKa. If the thermodynamics of a process are hard to measure but there are several “canonical” steps in the process, each step will add its own contribution to the total error. For example, solubility can be looked upon as a twostep process: sublimation from the crystal form and solvation of the free species into water (or other solvent). Each step has its own energy component and error. Reaction rates often involve several steps, each of which can be measured, perhaps, to good precision, however the total rate may be inaccurate due to accumulation of error.
What is important here is to remember that there may be multiple sources of error. If they can be estimated individually they can be combined in an appropriate manner, e.g. as in the general equations above.
Estimating unknown contributions to error
As an example, suppose we have a computational technique, say docking, and a test set of structures. We diligently apply our docking program to the test set, recording the range of performance statistics across this set. Now we give out both the test set and the software to the community to perform their own validations. To our surprise the results that are returned are not the same as ours but vary considerably depending on who used the program. For instance, it has been demonstrated that some programs with stochastic components behave quite differently on different platforms [6], or it could be because different program presets were used. Overall, the variance of performance is a reflection of user variability. If we know all other sources of error, for instance the limited number of systems, the limited number of actives and decoys and so on, then the remaining error is simply user variability.
Alternatively, we could look at our own evaluation in more detail. Some variability will arise because of the range of aptitude of the technique for different protein systems and some will arise because of the finite number of actives and decoys used for each system. As we can estimate the expected error for the latter, we can calculate a more accurate estimate of the intrinsic performance across different systems, i.e. we can estimate what our variability would have been if we had an infinite number of actives and decoys.
Assessing adding noisy systems to test sets
Finally, we have assumed the noise in the system as coming from the intrinsic variance of the method, i.e. if we had an infinite number of actives and decoys, plus the expected variance from the systems. There could be other sources of noise, for instance we mentioned above the potential contribution from different users applying the program. These terms would become a part of what we would see as the intrinsic variance.
Varianceweighted averages with examples
One way to clarify when this formula should be used is to ask what would happen if one measurement was exceedingly accurate. In the case of combining different systems this would mean that the value associated with this one system would dominate the average—clearly not what we would want. However, if we are presented with several estimates of a LogP we would clearly be satisfied if one measurement was very accurate and would intuitively use it rather than combine it with other, less accurate values.
Table showing example pKa values for three measurements with associated SD for each experimental measurement
pKa1  pKa2  pKa3  

Value  4.2  4.4  4.9 
SD  0.2  0.2  0.5 
We consider three cases:
Case 1
Case 2
Case 3
The most accurate result is from Case 1 with the correct variance weighting of data. Adding the third result without taking into account its expected error actually made things worse compared to leaving it out entirely. This illustrates the importance in knowing the accuracy of independent measurements. In addition, it can be used to assess the potential consequences of including a measurement of unknown accuracy.
This leads into the area of outlier removal that is beyond the scope of this article, but the concept is straightforward. Suppose we are suspicious of the third measurement, and of its uncertainty estimate. Using just the first two values we obtain an estimate of 4.30, plus an uncertainty of 0.14. This makes the third value of 4.9 appear unlikely unless it has a large variance. Pierce’s criterion [44] is to calculate the likelihood of all three observations, compared to the likelihood of two good measurements and one being in error. It is also similar in spirit to the discussion above as to whether adding a new, noisy system to a set improves or degrades the expected accuracy.
The varianceweighted formula is typically used in the field of metaanalysis, i.e. where an effect size is estimated by combining different studies of different inherent accuracy. However, there is no reason it cannot also be applied to problems in computational chemistry when results of widely different provenance and accuracy are involved.
Weighted averages of variances
Bootstrapping error bars
Introduction
In recent years computing power has made it possible to estimate many statistical quantities without the need for analytic formulae. Put simply, the data at hand is resampled “with replacement” and each time the metric of interest is recalculated. The distribution of this set of recalculated numbers is then used to derive statistics of interest. The phrase “with replacement” just means that if you start with N observations, the new “random” set of N observations can contain (almost certainly will contain) repeated instances. E.g. if your dataset is {1, 4, 3, 2, 5} a bootstrapped sample might be {1, 1, 5, 3, 3}, i.e. same number of data points but drawn randomly from the original. As a rule of thumb, about one quarter of the data points will be used more than once. To get a 95 % confidence limit from bootstrapping you observe the range around the mean that contains 95 % of the resampled quantity of interest. Because our bounds are drawn from the calculated distribution they cannot exceed ‘natural’ limits, e.g. [0,1] for probabilities. Neither do we have to worry about small sample size effects, or effective degrees of freedom. No mathematics required! Just resample many times (typically thousands) until the statistics you desire seem stable. As computational time is these days cheap, this is feasible for nearly any application in the field of molecular simulation.
It can be tempting to assume that bootstrapping is all that is ever needed, but this is incorrect. An obvious obstacle is if the primary data is not available, or difficult to extract from its deposited form, e.g. embedded in a PDF file. It is surprising how often journals allow the deposition of hardtouse data to count as ‘submission of data’. One wonders how far structural biology would have got if PDB files were only available as PDF files! With classical statistics a researcher can come up with onthefly estimate of confidence limits of a proposed finding. Such checks with reality are often useful at scientific meetings.
Having an analytic formula allows a scientist to think about the character of the error—e.g. what the dominant terms will be, how they will behave with respect to size. This should be a natural part of how scientists think about experiments and can get lost if everything is just simulated. When Schrödinger was presented with the result of an early computerderived numerical solution to his famous equation he is supposed to have commented, “I know it (the computer) understands the answer; however, I’d like to understand it too”. Sometimes a result is all you need and at other times an understanding of the result is more important.
Limitations
 (i)
Bootstrapping may be problematic for calculating the mode of a set of observations, i.e. the most common occurrence. As you are introducing multiple copies of observations you can end up measuring how often you oversample a single observation [45].
 (ii)
Calculating the maximum or minimum of a function is also not natural for this procedure. Any resampling that leaves out the maximum or minimum can only underestimate these extrema, i.e. bootstrapping cannot help but average to a lower value than it should. This has relevance in the calculation of enrichment in virtual screening when the percent of inactives screened is so small essentially you are measuring the extreme values of the scores for actives.
 (iii)
A significant limitation of bootstrapping is the calculation of correlation coefficients. This makes intuitive sense from the character of a correlation. If we replace one of our data points with a duplicate of another then the correlation is likely (but not guaranteed) to increase, meaning that the average correlation of a sampled population may appear higher than the true correlation. Note this is not true for the distribution of the slope from linear regression, which is normally distributed.
 (iv)
Confidence limits far from the mean. The problem here is one of sampling. To get reliable confidence limits we need a sufficient number of examples that lie outside of these limits. This may be difficult or even impossible from resampling. “Difficult” because the number of resamplings may become prohibitive in order to see enough rare events. “Impossible” because if the sample set is small even the exhaustive evaluation of all possible samplings may not sample the extrema possible for this data set. E.g. imagine calculating the mean of the small set from the introduction to this section; no resamplings can give an average less than 1 or greater than 5, probability p = (0.2)^{5} = 0.00032 for each. Therefore, a significance level of 0.0001 can never be established.
"A good way to think of bootstrap intervals is as a cautious improvement over standard intervals, using large amounts of computation to overcome certain deficiencies of the standard methods, for example its lack of transformation invariance. The bootstrap is not intended to be a substitute for precise parametric results but rather a way to reasonably proceed when such results are unavailable." (p. 295)
Advantages
However, there are real advantages. As bootstrapping is nonparametric, it can be very appropriate if the actual distribution is unusual. As an example, let’s briefly reconsider the very concept of the confidence interval. Traditional methods give such a range as half of the confidence interval above the mean and half below. As we have seen the upper and lower ranges don’t have to be symmetric. Useful tricks such the Fisher transform can sometimes get us estimates of ranges about the mean anyway, but perhaps the distribution of the value we are interested in looks nothing like a Gaussian, or a Studentt function, or any other parametric form. Then bootstrapping comes into its own. A good example of this type of distribution can be seen in Shalizi [48] where, in a very readable article, he looks at the daytoday returns of the stock market.
Conclusions
We have presented here the first part of a description of how confidence intervals for quantities frequently found in computational chemistry may be assessed. Issues such as asymmetrical error bars, small sample sizes and combining sources of error have been addressed, and a survey of analytic results, some old and some new. The importance of the latter as a method of thinking about the expected error has been emphasized, although certainly modern methods such as bootstrapping do offer alternatives to “back of the envelope” estimates. It would be impossible for such an article to cover all the clever formulae and techniques applicable to modeling even in just classical, parametric statistics; both modeling and statistics are too large a subject matter for that. Nor was it possible to cover much in the way of applicable nonparametric statistics, even though they are an attractive alternative when data is not normally distributed or where more robust measures are required. Not that those classical techniques cannot be made robust, but there was little room to describe these techniques either! Most regretfully, more could not be introduced concerning the Bayesian formalism, a path, which once started upon is hard to turn from, so general and powerful is the approach. However, there are many excellent textbooks that cover these and other aspects of statistics that may be useful to a computer modeler [49, 50, 51, 52].
The followon paper to this will address the comparison of two or more quantities with associated error estimates. This involves the calculation of “mutual” error bars, i.e. confidence internals on the difference in properties. Here attention must be paid to the covariance aspect of variation, i.e. if the noisiness we hope to quantify for one property is correlated with the noise in another we have to exercise care in how this mutual difference is assessed. In our opinion this second paper contains more novel and research oriented material, simply because of the dearth of material even in the statistical literature on some topics, such as the effects of correlation on Pearson’s rvalue. The effects of correlation can be subtle and it is hoped the results presented will prevent others from making the same mistakes as the author has made when first presented with the issues that arise.
In terms of affecting the standards of statistics displayed in the literature there is only so much that can be done without the coordinated efforts of journals and scientists. It requires standards to both be set and adhered to if computational chemistry is to improve in this way. Fortunately there seems to be a more general consensus that statistics should be taken more seriously in the sciences [53, 54], perhaps as a result of the number of retracted studies or papers illustrating how often reports are later contradicted by more thorough studies [55]. There should be no illusions that statistical standards will solve the many problems of our field. Our datasets are usually poorly constructed and limited in extent. Problems are selected because they give the results we want, not to reflect an accurate sampling of real world application. Poor or negative results are not published [56]. In addition, there is nothing to stop the inappropriate use of statistics, whether inadvertent or with intent to mislead. Many regrettable habits will not be affected.
It is not this author’s intent or purpose to set or even suggest such standards. Rather, this paper and the one that follows are attempts to communicate the potential richness and productive character of statistics as applied to our very empirical field. As an aid to the adoption of the methods presented here it is intended to develop a website, caddstat.eyesopen.com that will provide online forms for statistical calculation relevant to computational chemistry. This site will be described in a subsequent publication.
For any field to progress it must be able to assess its current state, and statistics provides the tools for that assessment. Used honestly and consistently statistical techniques allow a realistic perspective to emerge of our field's problems and successes. As drug discovery becomes harder and more expensive, it is ever more important that the application of computation methods actually deliver their original promise of speeding and improving pharmaceutical design. A more statistically informed field should be a part of that future.
Notes
Acknowledgements
The author would like to thank for their contributions in either reading this manuscript, or suggesting topics, or general encouragement: Georgia McGaughey, Anna Linusson Jonsson, Pat Walters, Kim Branson, Tom Darden, John Chodera, Terry Stouch, Peter Guthrie, Eric Manas, Gareth Jones, Mark McGann, Brian Kelley and Geoff Skillman. Also, for their patience while this work was completed: Bob Tolbert, Terry Stouch and, in particular, Georgia, Emma and Colette.
References
 1.Loredo TJ (1990) From Laplace to Supernova SN 1987A: Bayesian inference in astrophysics. In: Fougere PF (ed) Maximum entropy and bayesian methods. Kluwer Academic, Dordrecht, pp 81–142CrossRefGoogle Scholar
 2.Silvia DS (1996) Data analysis: a Bayesian tutorial. Oxford Science Publications, OxfordGoogle Scholar
 3.Marin JM, Robert CP (2007) Bayesian core: a practical approach to computational bayesian statistics. Springer, New YorkGoogle Scholar
 4.Carlin BP, Loius TA (2000) Bayes and empirical Bayes methods for data analysis, 2nd edn. Chapman & Hall/CRC, LondonCrossRefGoogle Scholar
 5.Jeffreys H (1939) Theory of probability. Cambridge University Press, CambridgeGoogle Scholar
 6.Feher M, Williams CI (2012) Numerical errors and chaotic behavior in docking simulations. JCIM 52:724–738Google Scholar
 7.Ziliak ST, McCloskey DN (2007) The cult of statistical significance: how the standard error costs us jobs, justice and lives. University of Michigan Press, Ann ArborGoogle Scholar
 8.Gelman A (2013) P values and statistical practice. Epidemiology 24:69–72CrossRefGoogle Scholar
 9.Johnson V (2013) Uniformly most powerful Bayesian tests. Ann Stat 41:1716–1741CrossRefGoogle Scholar
 10.Taleb NN (2007) The black swan. The impact of the highly improbable. Random House, ISBN 0679604189, 9780679604181Google Scholar
 11.Gladwell M (2002) Blowing up. The New Yorker, April 22, 2002, p 162Google Scholar
 12.Kotz S, Nadarajah S (2000) Extreme value distributions: theory and applications. Imperial College Press, LondonCrossRefGoogle Scholar
 13.Stigler SM (1977) Eight centuries of sampling inspection: the trial of the Pyx. J Am Stat Soc 72: 359Google Scholar
 14.Wainer H (2007) The most dangerous equation: ignorance of how sample size affects statistical variation has created havoc for nearly a millennium. Am Sci 961:248–256Google Scholar
 15.Tukey JW (1960) A survey of sampling from contaminated distributions. In: Olkin I, Ghurye S, Hoeffding W, Madow W, Mann W (eds) Contributions to probability and statistics. Stanford University Press, Stanford, pp 448–485Google Scholar
 16.Ripley BD Robust statistics. http://www.stats.ox.ac.uk/pub/StatMeth/Robust.pdf
 17.Student (aka Gosset WS) (1908) The probable error of a mean. Biometrika 6(1):1–25Google Scholar
 18.Gelman A, Jakulin A, Grazia Pittau M, Su Y (2008) A weakly informative default prior distribution for logistic and other regressions models. Ann Appl Stat 2:1360–1383CrossRefGoogle Scholar
 19.DeLong ER, DeLong DM, ClarkePearson DL (1988) Comparing the area under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44:837–845Google Scholar
 20.Cortes C, Mohri M (2004) Confidence intervals for the area under the ROC curve. Adv Neural Inf Process Syst 17:305–312Google Scholar
 21.Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36Google Scholar
 22.Nicholls A (2011) What do we know? Simple statistical techniques that help. Methods Mol Biol 672:531–581, 531:582Google Scholar
 23.Huang N, Shoichet B, Irwin JJ (2006) Benchmarking sets for molecular docking. JMC 49:6789–6801Google Scholar
 24.Cortes C, Mohri M (2003) AUC optimization vs. error rate minimization. In: Advances in neural information processing systems (NIPS 2003), vol 16. MIT Press, Vancouver, CanadaGoogle Scholar
 25.Welch BL (1946) The generalization of “student’s” problem when several different population variances are involved. Biometrika 34:28–35Google Scholar
 26.Satterhwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2:110–114Google Scholar
 27.Nicholls A (2008) What do we know and when do we know it? JCAMD 22(3–4):239–255Google Scholar
 28.Jain AN, Nicholls A (2008) Recommendations for evaluations of computational methods. JCAMD 22:133–139Google Scholar
 29.Qui D, Shenkin PS, Hollinger FP, Still WC (1997) A fast analytic method for the calculation of approximate Born radii. J Phys Chem A 101:3005–3014Google Scholar
 30.McKinnery JC (2006) Relativistic forcefree electrodynamic simulations of neutron star magnetospheres. MNRAS 386:30–34Google Scholar
 31.Guthrie JP (2014) SAMPL4, a blind challenge for computational solvation free energies: the compounds considered. J Comput Aided Mol Des 28(3):151–168Google Scholar
 32.de Levie (2012) Collinearity in leastsquares analysis. J Chem Educ 89:68–78Google Scholar
 33.Pearson K (1904) Mathematical contributions in the theory of evolution, XIII: On the theory of contingency and its relation to association and normal correlation. In: Drapers’ company research memoirs (Biometric Series I), University College (reprinted in Early Statistical Papers (1948) by the Cambridge University Press, Cambridge, U.K.), London, p 37Google Scholar
 34.Fisher RA (1915) Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10(4):507–521Google Scholar
 35.Bonett DG, Wright TA (2000) Sample size requirements for estimating Pearson, Kendall and Spearman correlations. Psychometrika 65:23–28Google Scholar
 36.
 37.Theil H (1961) Economic forecasts and policy, 2nd edn. NorthHolland Publishing Company, AmsterdamGoogle Scholar
 38.Romero AA (2007) A note on the use of adjusted R2 in model selection. College of William and Mary, working papers, no. 62, October 2007Google Scholar
 39.Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6:461–464Google Scholar
 40.Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723Google Scholar
 41.Manas ES, Unwalla RJ, Xu ZB, Malamas MS, Miller CP, Harris HA, Hsiao C, Akopian T, Hum WT, Malakian K, Wolfrom S, Bapat A, Bhat RA, Stahl ML, Somers WS, Alvarez JC (2004) Structurebased design of estrogen receptorbeta selective ligands. JACS 126:15106–15119CrossRefGoogle Scholar
 42.Geary RC (1936) The distribution of ‘student’s’ ratio for nonnormal samples. Suppl J R Stat Soc 3:178–184Google Scholar
 43.Conan Doyle A (1890) The sign of four, Chap 1. Spencer Blackett, London, p 92Google Scholar
 44.Peirce B (1852) Criterion for the rejection of doubtful observations. Astron J 45:161–163CrossRefGoogle Scholar
 45.Romano JP (1988) Bootstrapping the mode. Ann Inst Stat Math 40:565–586CrossRefGoogle Scholar
 46.Efron B (1981) Nonparametric standard errors and confidence intervals. Can J Stat 9:139–172Google Scholar
 47.Efron B (1988) Bootstrap confidence intervals: good or bad? Psychol Bull 104:293–296CrossRefGoogle Scholar
 48.Shalizi C (2011) The bootstrap. Am Sci 98:186–190Google Scholar
 49.Kanji GK (2006) 100 Statistical tests. Sage Publications, LondonGoogle Scholar
 50.Wasserman L (2007) All of nonparametric statistics. In: Springer texts in statistics. ISBN10: 0387251456Google Scholar
 51.Glantz SA (1980) How to detect, correct, and prevent errors in the medical literature. Circulation 61:1–7CrossRefGoogle Scholar
 52.Snedecor GW, Cochran WG (1989) Statistical methods, 8th edn. Blackwell Publishing, OxfordGoogle Scholar
 53.Nature Editorial (2013) Reducing our irreproducibility 496:398Google Scholar
 54.Nuzzo R (2014) Scientific method: statistical errors. Nature 506:150–152Google Scholar
 55.Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2:124Google Scholar
 56.Scargle JD (2000) Publication Bias: the “FileDrawer” Problem in Scientific Inference. J Sci Explor 14:91–106Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.