Predicting the Performance of Fingerprint Similarity Searching
Fingerprints are bit string representations of molecular structure that typically encode structural fragments, topological features, or pharmacophore patterns. Various fingerprint designs are utilized in virtual screening and their search performance essentially depends on three parameters: the nature of the fingerprint, the active compounds serving as reference molecules, and the composition of the screening database. It is of considerable interest and practical relevance to predict the performance of fingerprint similarity searching. A quantitative assessment of the potential that a fingerprint search might successfully retrieve active compounds, if available in the screening database, would substantially help to select the type of fingerprint most suitable for a given search problem. The method presented herein utilizes concepts from information theory to relate the fingerprint feature distributions of reference compounds to screening libraries. If these feature distributions do not sufficiently differ, active database compounds that are similar to reference molecules cannot be retrieved because they disappear in the “background.” By quantifying the difference in feature distribution using the Kullback–Leibler divergence and relating the divergence to compound recovery rates obtained for different benchmark classes, fingerprint search performance can be quantitatively predicted.
Key wordsBayesian statistics Compound activity classes Fingerprints Information theory Kullback–Leibler divergence Prediction of compound recall Search performance Virtual screening
- 7.MACCS Structural Keys. Symyx Technologies, Inc., Sunnyvale, CA, http://www.symyx.com (accessed Sep 1, 2009).
- 8.James, C. A, Weininger, D. Daylight Theory Manual, Vers. 4.9, Daylight Chemical Information Systems Inc., Aliso Viejo, CA, http://www.daylight.com/dayhtml/doc/theory (accessed Sep 1, 2009).
- 12.Mason, J. S., Morize, I., Menard, P. R., Cheney, D. L., Hulme, C., and Labaudiniere, R. F. (1999) New 4-point pharmacophore method for molecular similarity and diversity applications: overview over the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. J. Med. Chem. 42, 3251–3264.PubMedCrossRefGoogle Scholar
- 14.Maggiora, G. M., and Johnson, M. A. (1990) Concepts and Applications of Molecular Similarity. Wiley: New York, NY, pp 99–117.Google Scholar
- 20.Lewis, D. D. (1998) Naïve (Bayes) at forty: the independence assumption in information retrieval. In Lecture notes in computer science: Machine learning ECML-98, Springer: Berlin, 4–15.Google Scholar
- 21.Zhang, H. (2004) The optimality of naïve Bayes. In Proceedings of the seventeenth Florida artificial intelligence research society conference. The AAAI Press: Menlo Park, CA, 562–567.Google Scholar
- 28.Berthold, M. and Hand, D. J. (2007) Intelligent Data Analysis: An Introduction. Springer: Berlin, Heidelberg, Germany, pp 245–246.Google Scholar
- 29.Kullback, S. (1997) Information Theory and Statistics. Dover Publications: Mineola, MN, pp. 1–11.Google Scholar
- 31.Molecular Operating Environment (MOE), Vers. 2005.06, Chemical Computing Group Inc., 1255 University Street, Montreal, Quebec, Canada, H3B 3X3, http://www.chemcomp.com (accessed Sep 1, 2009).