Skip to main content


Log in

Tales from the war on error: the art and science of curating QSAR data

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript


Curating the data underlying quantitative structure–activity relationship models is a never-ending struggle. Some curation can now be automated but much cannot, especially where data as complex as those pertaining to molecular absorption, distribution, metabolism, excretion, and toxicity are concerned (vide infra). The authors discuss some particularly challenging problem areas in terms of specific examples involving experimental context, incompleteness of data, confusion of units, problematic nomenclature, tautomerism, and misapplication of automated structure recognition tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others


  1. Supplemental material provided by O’Reilly et al. [29] provides an excellent overview of how to interconvert the various types of specifications and half-life measurements.

  2. Variously attributed to Bill Vaughn and Paul Ehrlich.

  3. The particular problematic “compounds” found revealed incidental limitations in the SMILES parser used that have little practical relevance but that have since been addressed.


  1. Williams AJ, Ekins S (2011) A quality alert and call for improved curation of public chemistry databases. Drug Disc Today 16(17–18):747–750. doi:10.1016/j.drudis.2011.07.007

    Article  CAS  Google Scholar 

  2. Bologa CG, Oprea TI (2012) Compound collection preparation for virtual screening. In: Larson RS (ed) Bioinformatics and drug discovery. Methods in molecular biology, 2nd edn. Humana Press, New York, pp 125–143

    Chapter  Google Scholar 

  3. Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27(11–12):1337–1345

    Article  CAS  Google Scholar 

  4. Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inf 29(6–7):476–488

    Article  CAS  Google Scholar 

  5. Williams A, Tkachenko V (2014) The Royal Society of Chemistry and the delivery of chemistry data repositories for the community. J Comput Aided Mol Des 28(10):1023–1030. doi:10.1007/s10822-014-9784-5

    Article  CAS  Google Scholar 

  6. MedChem Studio. 4.0 edn. Simulations Plus, Inc., Lancaster, CA, USA

  7. ADMET Predictor. 7.2 edn. Simulations Plus, Inc., Lancaster, CA, USA

  8. Fraczkiewicz R, Lobell M, Göller AH, Krenz U, Schoenneis R, Clark RD, Hillisch A (2015) Best of both worlds: combining pharma data and state of the art modeling technology to improve in silico pKa prediction. J Chem Inf Model 55(2):389–397

    Article  CAS  Google Scholar 

  9. World Drug Index (2008) Thomson Reuters, New York

  10. Clark R, Liang W, Lee A, Lawless M, Fraczkiewicz R, Waldman M (2014) Using beta binomials to estimate classification uncertainty for ensemble models. J Cheminf 6(1):34

    Article  Google Scholar 

  11. Ran Y, Jain N, Yalkowsky SH (2001) Prediction of aqueous solubility of organic compounds by the general solubility equation (GSE). J Chem Inf Comput Sci 41(5):1208–1217. doi:10.1021/ci010287z

    Article  CAS  Google Scholar 

  12. Tetko IV, Sushko Y, Novotarskyi S, Patiny L, Kondratov I, Petrenko AE, Charochkina L, Asiri AM (2014) How accurately can we predict the melting points of drug-like compounds? J Chem Inf Model 54(12):3320–3329. doi:10.1021/ci5005288

    Article  CAS  Google Scholar 

  13. Lide DR (ed) (2006) CRC handbook of chemistry and physics, 86th edn. Taylor & Francis, Boca Raton

    Google Scholar 

  14. Windholz M (ed) (1983) Merck index: encyclopedia of chemicals, drugs and biologicals, 10th edn. Merck & Co Inc, Rahway

    Google Scholar 

  15. Avdeef A, Barrett DA, Shaw PN, Knaggs RD, Davis SS (1996) Octanol-, chloroform-, and propylene glycol dipelargonat-water partitioning of morphine-6-glucuronide and other related opiates. J Med Chem 39(22):4377–4381

    Article  CAS  Google Scholar 

  16. Clarke S, Jeffrey P (2001) Utility of metabolic stability screening: comparison of in vitro and in vivo clearance. Xenobiotica 31(8–9):591–598

    Article  CAS  Google Scholar 

  17. Pryde DC, Dalvie D, Hu Q, Jones P, Obach RS, Tran T-D (2010) Aldehyde oxidase: an enzyme of emerging importance in drug discovery. J Med Chem 53(24):8441–8460

    Article  CAS  Google Scholar 

  18. Miners JO, Knights KM, Houston JB, Mackenzie PI (2006) In vitro–in vivo correlation for drugs and other compounds eliminated by glucuronidation in humans: pitfalls and promises. Biochem Pharmacol 71(11):1531–1539

    Article  CAS  Google Scholar 

  19. Kaivosaari S, Finel M, Koskinen M (2011) N-glucuronidation of drugs and other xenobiotics by human and animal UDP-glucuronosyltransferases. Xenobiotica 41(8):652–669

    Article  CAS  Google Scholar 

  20. Bu H-Z (2006) A literature review of enzyme kinetic parameters for CYP3A4-mediated metabolic reactions of 113 drugs in human liver microsomes: structure–kinetics relationship assessment. Curr Drug Metab 7(3):231–249

    Article  CAS  Google Scholar 

  21. Lee CA, Kadwell SH, Kost TA, Serabjitsingh CJ (1995) CYP3A4 expressed by insect cells infected with a recombinant baculovirus containing both CYP3A4 and human NADPH-cytochrome P450 reductase is catalytically similar to human liver microsomal CYP3A4. Arch Biochem Biophys 319(1):157–167

    Article  CAS  Google Scholar 

  22. Venkatakrishnan K, von Moltke LL, Greenblatt DJ (1999) Nortriptyline E-10-hydroxylation in vitro is mediated by human CYP2D6 (high affinity) and CYP3A4 (low affinity): implications for interactions with enzyme-inducing drugs. J Clin Pharmacol 39(6):567–577

    Article  CAS  Google Scholar 

  23. Yoshii K, Kobayashi K, Tsumuji M, Tani M, Shimada N, Chiba K (2000) Identification of human cytochrome P450 isoforms involved in the 7-hydroxylation of chlorpromazine by human liver microsomes. Life Sci 67(2):175–184

    Article  CAS  Google Scholar 

  24. Wójcikowski J, Boksa J, Daniel WA (2010) Main contribution of the cytochrome P450 isoenzyme 1A2 (CYP1A2) to N-demethylation and 5-sulfoxidation of the phenothiazine neuroleptic chlorpromazine in human liver—a comparison with other phenothiazines. Biochem Pharmacol 80(8):1252–1259

    Article  Google Scholar 

  25. Morel E, Lloyd K, Dahl S (1987) Anti-apomorphine effects of phenothiazine drug metabolites. Psychopharmacol 92(1):68–72

    Article  CAS  Google Scholar 

  26. Mautz DS, Nelson WL, Shen DD (1995) Regioselective and stereoselective oxidation of metoprolol and bufuralol catalyzed by microsomes containing cDNA-expressed human P4502D6. Drug Metab Dispos 23(4):513–517

    CAS  Google Scholar 

  27. Hayhurst G, Harlow J, Chowdry J, Gross E, Hilton E, Lennard M, Tucker G, Ellis S (2001) Influence of phenylalanine-481 substitutions on the catalytic activity of cytochrome P450 2D6. Biochem J 355:373–379

    Article  CAS  Google Scholar 

  28. Matsunaga M, Yamazaki H, Kiyotani K, Iwano S, Saruwatari J, Nakagawa K, Soyama A, Ozawa S, Sawada J-I, Kashiyama E (2009) Two novel CYP2D6* 10 haplotypes as possible causes of a poor metabolic phenotype in Japanese. Drug Metab Dispos 37(4):699–701

    Article  CAS  Google Scholar 

  29. O’Reilly MC, Scott SA, Brown KA, Oguin TH, Thomas PG, Daniels JS, Morrison R, Brown HA, Lindsley CW (2013) Development of dual PLD1/2 and PLD2 selective inhibitors from a common 1,3,8-triazaspiro[4.5]decane core: discovery of ML298 and ML299 that decrease invasive migration in U87-MG glioblastoma cells. J Med Chem 56(6):2695–2699. doi:10.1021/jm301782e

    Article  Google Scholar 

  30. Kiyoi T, Adam JM, Clark JK, Davies K, Easson A-M, Edwards D, Feilden H, Fields R, Francis S, Jeremiah F, McArthur D, Morrison AJ, Prosser A, Ratcliffe PD, Schulz J, Wishart G, Baker J, Campbell R, Cottney JE, Deehan M, Epemolu O, Evans L (2011) Discovery of potent and orally bioavailable heterocycle-based cannabinoid CB1 receptor agonists. Bioorg Med Chem Lett 21(6):1748–1753. doi:10.1016/j.bmcl.2011.01.082

    Article  CAS  Google Scholar 

  31. Balakin KV, Ekins S, Bugrim A, Ivanenkov YA, Korolev D, Nikolsky YV, Ivashchenko AA, Savchuk NP, Nikolskaya T (2004) Quantitative structure–metabolism relationship modeling of metabolic N-dealkylation reaction rates. Drug Metab Dispos 32(10):1111–1120

    Article  CAS  Google Scholar 

  32. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36. doi:10.1021/ci00057a005

    Article  CAS  Google Scholar 

  33. Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29(2):97–101. doi:10.1021/ci00062a008

    Article  CAS  Google Scholar 

  34. CAS REGISTRY—the gold standard for chemical substance information (2015) Chemical abstracts service.

  35. Farid NA, Kurihara A, Wrighton SA (2010) Metabolism and disposition of the thienopyridine antiplatelet drugs ticlopidine, clopidogrel, and prasugrel in humans. J Clin Pharmacol 50(2):126–142

    Article  CAS  Google Scholar 

  36. Bartolini B, Corniello C, Sella A, Somma F, Politi V (2003) The enol tautomer of indole-3-pyruvic acid as a biological switch in stress responses. In: Allegri G, Costa CL, Ragazzi E, Steinhart H, Varesio L (eds) Developments in tryptophan and serotonin metabolism, vol 527. Advances in experimental medicine and biology. Springer, pp 601–608. doi:10.1007/978-1-4615-0135-0_69

  37. He M, Korzekwa KR, Jones JP, Rettie AE, Trager WF (1999) Structural forms of phenprocoumon and warfarin that are metabolized at the active site of CYP2C9. Arch Biochem Biophys 372(1):16–28. doi:10.1006/abbi.1999.1468

    Article  CAS  Google Scholar 

  38. Fernandes P, Florence AJ, Shankland K, Shankland N, Johnston A (2006) Powder study of chlorothiazide N,N-dimethylformamide solvate. Acta Crystallogr E 62(6):o2216–o2218. doi:10.1107/S1600536806015674

    Article  CAS  Google Scholar 

  39. Angyal S, Warburton W (1951) Sulphonamides. II. Structure and tautomerism of sulphapyridine, sulphathiazole, and sulphanilylbenzamidine. Aust J Chem 4(1):93–106

    Article  Google Scholar 

  40. Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12—PubChem: integrated platform of small molecules and biological activities. In: Ralph AW, David CS (eds) Annual reports in computational chemistry, vol 4. Elsevier, pp 217–241. doi:10.1016/S1574-1400(08)00012-1

  41. Durant G, Emmett J, Ganellin C, Miles P, Parsons M, Prain H, White G (1977) Cyanoguanidine-thiourea equivalence in the development of the histamine H2-receptor antagonist, cimetidine. J Med Chem 20(7):901–906

    Article  CAS  Google Scholar 

  42. Sundriyal S, Khanna S, Saha R, Bharatam PV (2008) Metformin and glitazones: Does similarity in biomolecular mechanism originate from tautomerism in these drugs? J Phys Org Chem 21(1):30–33

    Article  CAS  Google Scholar 

  43. Lipinski CA, Litterman NK, Southan C, Williams AJ, Clark AM, Ekins S (2015) Parallel worlds of public and commercial bioactive chemistry data. J Med Chem 58(5):2068–2076. doi:10.1021/jm5011308

    Article  CAS  Google Scholar 

  44. PubChem Substance Database (2015) National Center for Biotechnology Information. Accessed 15 Mar 2015

  45. Hamilton JH, Hofmann S, Oganessian YT (2013) Search for superheavy nuclei. Ann Rev Nucl Part Sci 63(1):383–405. doi:10.1146/annurev-nucl-102912-144535

    Article  CAS  Google Scholar 

  46. Asimov I (1957) The marvellous properties of thiotimoline. In: Only a trillion, 1st edn. Abelard-Schuman, London, pp 178–199

  47. Wikipedia (2015) Thiotimoline

Download references


The authors wish to thank Jinhua Zhang, Michael S. Lawless, Jayeeta Ghosh, and Michael Bolger for their help in ferreting out errors over the years. We also thank the Simulations Technology colleagues at Simulations Plus for their ongoing real-world testing of the models that were the ultimate product of our efforts: nothing is so effective an inducement to careful curation as knowing that the person across the hall depends on your getting it right. Thanks are also due to Ian Haworth (University of Southern California) and Terry Stouch (Science for Solutions, LLC) for the insight, inspiration, encouragement, and useful information they have provided us.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Robert D. Clark.




Some things to worry about


Units: mg/mL (ppt) versus mg/L (ppm) versus M


Miscibility and solubility are somewhat different things

Mixed solvents

Salts & other mixtures

Ionic strength

Use of buffers without checking the final pH

Precipitation of insoluble salts (e.g., phosphates)

Melting point


Salts versus free acids or bases


Esters versus salts

Spelling variants

Primes & double primes in names



Names match but structures do not

Strange amidine and guanidine valence isomers

Inverted tetrahedral & planar trisubstituted sp3 carbons


Electron-deficient amidines & aminopyridines

Hydroxypyridines usually exist as pyridones

Triketones usually enolize

Amides and esters only rarely enolize




Ionic strength

Tautomeric representation

Multiple closely spaced pKa’s

Protonated nitro groups

Doubly protonated piperazines at physiological pH

Identification of pKa’s with specific groups may be problematic in some cases

CYP assays

Complex kinetics

Substrate inhibition

Nonstandard recombinant assay systems

Mutant isoforms

Interference from other oxidases or hydrolases

Aromatic epoxidation versus hydroxylation

Disappearance of parent versus appearance of product

CLint determinations at single substrate concentrations

UGT assays

Reactive or unstable products

Endoplasmic reticulum accessibility artifacts

In vivo metabolites


Unstable or reactive metabolites

Secondary metabolites


Satellite peaks in distribution near three log units away from the average

Negative log units: M versus mM versus μM

Typographical errors: “m” for “μ”

Discrepancies between units in tables and in the text

Enzyme & binding assays

IC50 versus Ki

Comparability of assay conditions

Substrate or displaced ligand identity

Substrate or displaced ligand concentration

Source of the enzyme or receptor

Limit values like “>10 μM” becoming “10.00 μM”


The fact that internet sources agree does not make something true

If something looks too high or too low to be true: check it out

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waldman, M., Fraczkiewicz, R. & Clark, R.D. Tales from the war on error: the art and science of curating QSAR data. J Comput Aided Mol Des 29, 897–910 (2015).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: