Predicting the Performance of Fingerprint Similarity Searching

  • Martin Vogt
  • Jürgen Bajorath
Part of the Methods in Molecular Biology book series (MIMB, volume 672)


Fingerprints are bit string representations of molecular structure that typically encode structural fragments, topological features, or pharmacophore patterns. Various fingerprint designs are utilized in virtual screening and their search performance essentially depends on three parameters: the nature of the fingerprint, the active compounds serving as reference molecules, and the composition of the screening database. It is of considerable interest and practical relevance to predict the performance of fingerprint similarity searching. A quantitative assessment of the potential that a fingerprint search might successfully retrieve active compounds, if available in the screening database, would substantially help to select the type of fingerprint most suitable for a given search problem. The method presented herein utilizes concepts from information theory to relate the fingerprint feature distributions of reference compounds to screening libraries. If these feature distributions do not sufficiently differ, active database compounds that are similar to reference molecules cannot be retrieved because they disappear in the “background.” By quantifying the difference in feature distribution using the Kullback–Leibler divergence and relating the divergence to compound recovery rates obtained for different benchmark classes, fingerprint search performance can be quantitatively predicted.

Key words

Bayesian statistics Compound activity classes Fingerprints Information theory Kullback–Leibler divergence Prediction of compound recall Search performance Virtual screening 


  1. 1.
    Willett, P., Barnard, J. M., and Downs, G. M. (1998) Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996.CrossRefGoogle Scholar
  2. 2.
    Bajorath, J. (2002) Integration of virtual and high-throughput screening. Nature Rev. Drug Discov. 1, 882–894.CrossRefGoogle Scholar
  3. 3.
    Willett, P. (2005) Searching techniques for databases of two- and three-dimensional chemical structures. J. Med. Chem. 48, 4183–4199.PubMedCrossRefGoogle Scholar
  4. 4.
    Willett, P. (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov. Today 11, 1046–1053.PubMedCrossRefGoogle Scholar
  5. 5.
    Barnard, J. M. and Downs, G. M. (1997) Chemical fragment generation and clustering software. J. Chem. Inf. Comput. Sci. 37, 141–142.CrossRefGoogle Scholar
  6. 6.
    Durant, J. L., Leland, B. A., Henry, D. R., and Nourse, J. G. (2002) Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280.PubMedCrossRefGoogle Scholar
  7. 7.
    MACCS Structural Keys. Symyx Technologies, Inc., Sunnyvale, CA, (accessed Sep 1, 2009).
  8. 8.
    James, C. A, Weininger, D. Daylight Theory Manual, Vers. 4.9, Daylight Chemical Information Systems Inc., Aliso Viejo, CA, (accessed Sep 1, 2009).
  9. 9.
    Xue, L., Godden, J. W., Stahura, F. L., and Bajorath, J. (2003) Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J. Chem. Inf. Comput. Sci. 43, 1151–1157.PubMedCrossRefGoogle Scholar
  10. 10.
    Bender, A, Mussa, Y, Glen, R. C., and Reiling, S. (2004) Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J. Chem. Inf. Comput. Sci. 44, 1708–1718.PubMedCrossRefGoogle Scholar
  11. 11.
    Eckert, H. and Bajorath, J. (2006) Design and evaluation of a novel class-directed 2D fingerprint to search for structurally diverse active compounds. J. Chem. Inf. Model. 46, 2515–2526.PubMedCrossRefGoogle Scholar
  12. 12.
    Mason, J. S., Morize, I., Menard, P. R., Cheney, D. L., Hulme, C., and Labaudiniere, R. F. (1999) New 4-point pharmacophore method for molecular similarity and diversity applications: overview over the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. J. Med. Chem. 42, 3251–3264.PubMedCrossRefGoogle Scholar
  13. 13.
    Bradley, E. K., Beroza, P., Penzotti, J. E., Grootenhuis, P. D. J., Spellmeyer, D. C., and Miller, J. L. (2000) A rapid computational method for lead evolution: description and application to α1-adrenergic antagonists. J. Med. Chem. 43, 2770–2774.PubMedCrossRefGoogle Scholar
  14. 14.
    Maggiora, G. M., and Johnson, M. A. (1990) Concepts and Applications of Molecular Similarity. Wiley: New York, NY, pp 99–117.Google Scholar
  15. 15.
    Hert, J., Willet, P., and Wilton, D. J. (2004) Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J. Chem. Inf. Comput. Sci. 44, 1177–1185.PubMedCrossRefGoogle Scholar
  16. 16.
    Schuffenhauer, A., Floersheim, P., Acklin, P., and Jacoby, E. (2003) Similarity metrics for ligands reflecting the similarity of the target protein. J. Chem. Inf. Comput. Sci. 43, 391–405.PubMedCrossRefGoogle Scholar
  17. 17.
    Whittle, E., Gillet, V. J., Willett, P., and Loesel, J. (2006) Analysis of data fusion methods in virtual screening: theoretical model. J. Chem. Inf. Model. 46, 2193–2205.PubMedCrossRefGoogle Scholar
  18. 18.
    Whittle, E., Gillet, V. J., Willett, P., and Loesel, J. (2006) Analysis of data fusion methods in virtual screening: similarity searching and group fusion. J. Chem. Inf. Model. 46, 2206–2219.PubMedCrossRefGoogle Scholar
  19. 19.
    Hert, J., Willett, P, and Wilton, D. J. (2006) New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J. Chem. Inf. Model. 46, 462–470.PubMedCrossRefGoogle Scholar
  20. 20.
    Lewis, D. D. (1998) Naïve (Bayes) at forty: the independence assumption in information retrieval. In Lecture notes in computer science: Machine learning ECML-98, Springer: Berlin, 4–15.Google Scholar
  21. 21.
    Zhang, H. (2004) The optimality of naïve Bayes. In Proceedings of the seventeenth Florida artificial intelligence research society conference. The AAAI Press: Menlo Park, CA, 562–567.Google Scholar
  22. 22.
    Ormerod, A., Willett, P., Bawden, D. (1989) Comparison of fragment weighting schemes for substructural analysis. Quant. Struct.-Act. Relat. 8, 115–129.CrossRefGoogle Scholar
  23. 23.
    Eckert, H. and Bajorath, J. (2007) Molecular similarity analysis in virtual screening: foundations, limitations, and novel approaches. Drug Discov. Today 12, 225–233.PubMedCrossRefGoogle Scholar
  24. 24.
    Sheridan, R. P. and Kearsley, S. K. (2002) Why do we need so many chemical similarity search methods? Drug Discov. Today 7, 903–911.PubMedCrossRefGoogle Scholar
  25. 25.
    Vogt, M. and Bajorath, J. (2007) Introduction of a generally applicable method to estimate retrieval of active molecules for similarity searching using fingerprints. ChemMedChem 2, 1311–1320.PubMedCrossRefGoogle Scholar
  26. 26.
    Vogt, M., Godden, J. W., and Bajorath J. (2007) Bayesian interpretation of a distance function for navigating high-dimensional descriptor spaces. J. Chem. Inf. Model. 47, 39–46.PubMedCrossRefGoogle Scholar
  27. 27.
    Vogt, M. and Bajorath, J. (2007) Introduction of an information-theoretic method to predict recovery rates of active compounds for Bayesian in silico screening. J. Chem. Inf. Model. 47, 337–341.PubMedCrossRefGoogle Scholar
  28. 28.
    Berthold, M. and Hand, D. J. (2007) Intelligent Data Analysis: An Introduction. Springer: Berlin, Heidelberg, Germany, pp 245–246.Google Scholar
  29. 29.
    Kullback, S. (1997) Information Theory and Statistics. Dover Publications: Mineola, MN, pp. 1–11.Google Scholar
  30. 30.
    Cover, T. M., Thomas, J. A. (1991) Elements of Information Theory. Wiley-Interscience: New York, NY, pp. 224–238.CrossRefGoogle Scholar
  31. 31.
    Molecular Operating Environment (MOE), Vers. 2005.06, Chemical Computing Group Inc., 1255 University Street, Montreal, Quebec, Canada, H3B 3X3, (accessed Sep 1, 2009).
  32. 32.
    McGregor, M. and Pallai, P. (1997) Clustering of large databases of compounds: using the MDL “keys” as structural descriptors. J. Chem. Inf. Model. 37, 443–448.CrossRefGoogle Scholar
  33. 33.
    Irwin, J. J. and Shoichet, B. K. (2005) ZINC – A free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182.PubMedCrossRefGoogle Scholar
  34. 34.
    Vogt, M. and Bajorath, J. (2008) Bayesian screening for active compounds in high-dimensional chemical spaces combining property descriptors and fingerprints. Chem. Biol. Drug Design 71, 8–14.CrossRefGoogle Scholar
  35. 35.
    Vogt, M., Nisius, B., and Bajorath, J. (2009) Predicting the similarity search performance of fingerprints and their combination with molecular property descriptors using probabilistic and information-theoretic modeling. Stat. Anal. Data Mining 2, 123–134.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Martin Vogt
    • 1
  • Jürgen Bajorath
    • 2
  1. 1.Department of Life Science Informatics, B-ITRheinische Friedrich-Wilhelms-UniversitätBonnGermany
  2. 2.Department of Life Science Informatics, B-IT, LIMESRheinische Friedrich-Wilhelms-UniversitätBonnGermany

Personalised recommendations