## Abstract

Analytic formulae are used to estimate the error for two virtual screening metrics, enrichment factor and area under the ROC curve. These analytic error estimates are then compared to bootstrapping error estimates, and shown to have excellent agreement with respect to area under the ROC curve and good agreement with respect to enrichment factor. The major advantage of the analytic formulae is that they are trivial to calculate and depend only on the number of actives and inactives and the measured value of the metric, information commonly reported in papers. In contrast to this, the bootstrapping method requires the individual compound scores. Methods for converting the error, which is calculated as a variance, into more familiar error bars are also discussed.

### Similar content being viewed by others

## References

McGann M (2011) J Chem Inf Model 51(3):578–596

Hanley JA, McNeil BJ (1983) Radiology 148(3):839–843

Hanley JA, McNeil BJ (1982) Radiology 143(1):29–36

Triballeau N, Acher F, Brabet I, Pin JP, Bertrand HO (2005) J Med Chem 48(7):2534–2547

Henderson AR (2005) Clin Chim Acta 359(1–2):1–26

Nicholls A (2008) J Comput Aided Mol Des 22(3):239–255

Jain A, Nicholls A (2008) J Comput Aided Mol Des 22(3):133–139

Jain AN (2007) J Comput Aided Mol Des 21(5):281–306

Nicholls A (2014) J Comput Aided Mol Des 28(9):887–918

OMEGA OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507

FRED OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507

## Author information

### Authors and Affiliations

### Corresponding authors

## Additional information

Mark McGann and Istvan Enyedy have contributed equally to this work.

## Appendix: Calculating 95 % confidence assuming a binomial distribution

### Appendix: Calculating 95 % confidence assuming a binomial distribution

The details of calculating a CI95 using a binomial distribution bear some explanation. A binomial distribution is a discreet distribution with a range [0, N] with values at each integer value (*n*) in the range. Each value, *f* (*n*; N, *p*), is the probability of getting exactly *n* successes in N trials. Formally the definition is

where *p* is the probability of success, in our case the AUC, and

What we lack in the equations above is the number of trials, N, which we can compute by recognizing that the variance of a binomial distribution is \(\sigma^{2} = p\left( {1 - p} \right)/N\) and that we have variance from the Hanley formula (\(\sigma_{AUC}^{2}\)) shown above. Thus we can solve for N as follows

Now, recalling that *p* is simply the measured AUC, we can construct the binomial distribution. This distribution is discrete rather than continuous but becomes approximately continuous when N is large and in practice we have found that creating a continuous distribution by interpolating the value between points is effective.

Once the appropriate binomial distribution is constructed, we construct a cumulative distribution curve for the binomial and read the values at 2.5 and 97.5 % to obtain the 95 % confidence interval.

The above calculations are described for AUC, but the same method can be applied to EF by recognizing that EF (f_{I}) * f_{I} is also a probability and using this value in place of AUC. The resulting 95 % confidence interval is then multiplied by f_{I} to convert the result from the probability units [0,1] to the EF units.

## Rights and permissions

## About this article

### Cite this article

McGann, M., Nicholls, A. & Enyedy, I. The statistics of virtual screening and lead optimization.
*J Comput Aided Mol Des* **29**, 923–936 (2015). https://doi.org/10.1007/s10822-015-9861-4

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10822-015-9861-4