Skip to main content

A comparison of molecular representations for lipophilicity quantitative structure–property relationships with results from the SAMPL6 logP Prediction Challenge

Abstract

Effective representation of a molecule is required to develop useful quantitative structure–property relationships (QSPR) for accurate prediction of chemical properties. The octanol–water partition coefficient logP, a measure of lipophilicity, is an important property for pharmacological and toxicological endpoints used in the pharmaceutical and regulatory spheres. We compare physicochemical descriptors, structural keys, and circular fingerprints in their ability to effectively represent a chemical space and characterise molecular features to correlate with lipophilicity. Exploratory landscape continuity analyses revealed that whole-molecule physicochemical descriptors could map together compounds that were similar in both molecular features and logP, indicating higher potential for use in logP QSPRs compared to the substructural approach of structural keys and circular fingerprints. Indeed, logP QSPR models parameterised by physicochemical descriptors consistently performed with the lowest error. Our best performing model was a stochastic gradient descent-optimised multilinear regression with 1438 descriptors, returning an internal benchmark RMSE of 1.03 log units. This corroborates the well-established notion that lipophilicity is an additive, whole-molecule property. We externally tested the model by participating in the 2019 SAMPL6 logP Prediction Challenge and blindly predicting for 11 protein kinase inhibitor fragment-like molecules. Our model returned an RMSE of 0.49 log units, placing eighth overall and third in the empirical methods category (submission ID ‘hdpuj’). Permutation feature importance analyses revealed that physicochemical descriptors could characterise predictive molecular features highly relevant to the kinase inhibitor fragment-like molecules.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  1. Fujita T, Iwasa J, Hansch C (1964) A new substituent constant, π, derived from partition coefficients. J Am Chem Soc 86(23):5175–5180

    CAS  Article  Google Scholar 

  2. Iwasa J, Fujita T, Hansch C (1965) Substituent constants for aliphatic functions obtained from partition coefficients. J Med Chem 8(2):150–153

    CAS  Article  Google Scholar 

  3. Wang R, Fu Y, Lai L (1997) A new atom-additive method for calculating partition coefficients. J Chem Inf Comput Sci 37(3):615–621

    CAS  Article  Google Scholar 

  4. Moriguchi I et al (1992) Simple method of calculating octanol/water partition coefficient. Chem Pharm Bull 40(1):127–130

    CAS  Article  Google Scholar 

  5. Lo Y-C et al (2018) Machine learning in chemoinformatics and drug discovery. Drug Discov Today 23(8):1538–1546

    CAS  Article  Google Scholar 

  6. Mitchell JBO (2014) Machine learning methods in chemoinformatics. WIREs Comput Mol Sci 4(5):468–481

    CAS  Article  Google Scholar 

  7. Polanski J, Gasteiger J (2017) Computer representation of chemical compounds. In: Leszczynski J et al (eds) Handbook of computational chemistry. Springer International Publishing, Cham, pp 1997–2039

    Chapter  Google Scholar 

  8. Hall LH, Mohney B, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant Struct Act Relat 10(1):43–51

    CAS  Article  Google Scholar 

  9. Kier LB, Hall LH (1990) An electrotopological-state index for atoms in molecules. Pharm Res 7(8):801–807

    CAS  Article  Google Scholar 

  10. Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci 35(6):1039–1045

    CAS  Article  Google Scholar 

  11. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754

    CAS  Article  Google Scholar 

  12. Wang J-B et al (2015) In silico evaluation of logD7,4 and comparison with other prediction methods. J Chemom 29(7):389–398

    CAS  Article  Google Scholar 

  13. Wang R, Gao Y, Lai L (2000) Calculating partition coefficient by atom-additive method. Perspect Drug Discov Des 19(1):47–66

    CAS  Article  Google Scholar 

  14. Chen H-F (2009) In silico log P prediction for a large data set with support vector machines, radial basis neural networks and multiple linear regression. Chem Biol Drug Des 74(2):142–147

    CAS  Article  Google Scholar 

  15. Lowe EW et al (2011) Comparative analysis of machine learning techniques for the prediction of logP. In: 2011 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE, Paris

  16. Zang Q et al (2017) In silico prediction of physicochemical properties of environmental chemicals using molecular fingerprints and machine learning. J Chem Inf Model. 57(1):36–49

    CAS  Article  Google Scholar 

  17. Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474

    CAS  Article  Google Scholar 

  18. Todeschini, R, V Consonni (2009) Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references, vol 41. Wiley, Weinheim

    Book  Google Scholar 

  19. Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  20. Peltason L (2007) J Bajorath, SAR index: quantifying the nature of structure–activity relationships. J Med Chem 50(23):5571–5578

    CAS  Article  Google Scholar 

  21. Guha R, Van Drie JH (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48(3):646–658

    CAS  Article  Google Scholar 

  22. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302

    Article  Google Scholar 

  23. Bajusz D (2015) A Rácz, K Héberger, Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7(1):20

    Article  Google Scholar 

  24. Cheng T et al (2007) Computation of octanol−water partition coefficients by guiding an additive model with knowledge. J Chem Inf Model 47(6):2140–2148

    CAS  Article  Google Scholar 

  25. Mansouri K et al (2018) OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 10(1):10

    Article  Google Scholar 

  26. Martel S et al (2013) Large, chemically diverse dataset of logP measurements for benchmarking studies. Eur J Pharm Sci 48(1–2):21–29

    CAS  Article  Google Scholar 

  27. Daina A (2014) O Michielin, V Zoete, iLOGP: a simple, robust, and efficient description of n-octanol/water partition coefficient for drug design using the GB/SA approach. J Chem Inf Model 54(12):3284–3301

    CAS  Article  Google Scholar 

  28. Fraaije JGEM et al (2016) Coarse-grained models for automated fragmentation and parametrization of molecular databases. J Chem Inf Model 56(12):2361–2377

    CAS  Article  Google Scholar 

  29. Gedeck P (2017) S Skolnik, S Rodde, Developing collaborative QSAR models without sharing structures. J Chem Inf Model 57(8):1847–1858

    CAS  Article  Google Scholar 

  30. Plante J (2018) S Werner, JPlogP: an improved logP predictor trained using predicted data. J Cheminform 10(1):61

    CAS  Article  Google Scholar 

  31. Işık M et al (2019) Octanol-water partition coefficient measurements for the SAMPL6 Blind Prediction Challenge. J Comput Aided Mol Des. https://doi.org/10.1007/s10822-019-00271-3

    Article  PubMed  Google Scholar 

  32. Peltason L (2010) P Iyer, J Bajorath, Rationalizing three-dimensional activity landscapes and the influence of molecular representations on landscape topology and the formation of activity cliffs. J Chem Inf Model 50(6):1021–1033

    CAS  Article  Google Scholar 

  33. Mannhold R, van de Waterbeemd H (2001) Substructure and whole molecule approaches for calculating log P J Comput Aided Mol Des 15(4), 337–354.

    CAS  Article  Google Scholar 

  34. Zakharov AV et al (2019) Novel consensus architecture to improve performance of large-scale multitask deep learning QSAR models. J Chem Inf Model 59(11):4613–4624

    CAS  Article  Google Scholar 

  35. Moriwaki H et al (2018) Mordred: a molecular descriptor calculator. J Cheminform 10(1):4

    Article  Google Scholar 

  36. Cherkasov A et al (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57(12):4977–5010

    CAS  Article  Google Scholar 

  37. Wu Z et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530

    CAS  Article  Google Scholar 

  38. Tiño P et al (2004) Nonlinear prediction of quantitative structure−activity relationships. J Chem Inf Comput Sci 44(5):1647–1653

    Article  Google Scholar 

  39. Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter F, Kotthoff L, Vanschoren J (eds) Automated machine learning: methods, systems, challenges. Springer, Cham, pp 151–160

    Chapter  Google Scholar 

Download references

Acknowledgements

We thank the National Institutes of Health (Grant No. R01-GM124270) for their support in funding the SAMPL6 Challenges and associated experimental work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Slade Matthews.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lui, R., Guan, D. & Matthews, S. A comparison of molecular representations for lipophilicity quantitative structure–property relationships with results from the SAMPL6 logP Prediction Challenge. J Comput Aided Mol Des 34, 523–534 (2020). https://doi.org/10.1007/s10822-020-00279-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-020-00279-0

Keywords

  • QSPR
  • logP
  • Physicochemical properties
  • Machine learning
  • SAMPL6