Conformal prediction of biological activity of chemical compounds

  • Paolo ToccaceliEmail author
  • Ilia Nouretdinov
  • Alexander Gammerman
Open Access


The paper presents an application of Conformal Predictors to a chemoinformatics problem of predicting the biological activities of chemical compounds. The paper addresses some specific challenges in this domain: a large number of compounds (training examples), high-dimensionality of feature space, sparseness and a strong class imbalance. A variant of conformal predictors called Inductive Mondrian Conformal Predictor is applied to deal with these challenges. Results are presented for several non-conformity measures extracted from underlying algorithms and different kernels. A number of performance measures are used in order to demonstrate the flexibility of Inductive Mondrian Conformal Predictors in dealing with such a complex set of data. This approach allowed us to identify the most likely active compounds for a given biological target and present them in a ranking order.


Conformal prediction Confidence estimation Chemoinformatics Non-conformity measure 

Mathematics Subject Classification (2010)




This project (ExCAPE) has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671555. We are grateful for the help in conducting experiments to the Ministry of Education, Youth and Sports (Czech Republic) that supports the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center – LM2015070”. This work was also supported by EPSRC grant EP/K033344/1 (“Mining the Network Behaviour of Bots”) and by Technology Integrated Health Management (TIHM) project awarded to the School of Mathematics and Information Security at Royal Holloway as part of an initiative by NHS England supported by InnovateUK. We are indebted to Lars Carlsson of Astra Zeneca for providing the data and useful discussions. We are also thankful to Zhiyuan Luo and Vladimir Vovk for many valuable comments and discussions.


  1. 1.
    Monev, V.: Introduction to similarity searching in chemistry. Comm. Math. Comp. Chem. 51, 7–38 (2004)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Bottou, L., Chapelle, O., DeCoste, D., Weston, J.: Large-scale kernel machines (neural information processing). The MIT press (2007)Google Scholar
  3. 3.
    Bussonnier, M.: Interactive parallel computing in Python.
  4. 4.
    Pérez, F., Granger, B.E.: IPython: a system for interactive scientific computing, vol. 9 (2007).
  5. 5.
    Kluyver, T., et al.: Jupyter Notebooks – a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas, 87–90 doi: 10.3233/978-1-61499-649-1-87
  6. 6.
    Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at CrossRefGoogle Scholar
  7. 7.
    Chang, E.Y.: PSVM: parallelizing support vector machines on distributed computers. In: Foundations of Large-Scale Multimedia Information Management and Retrieval, pp. 213–230. Springer, Berlin Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Faulon, J.-L., Visco, D.P. Jr., Pophale, R.S.: The signature molecular descriptor. 1. using extended valence sequences in qsar and qspr studies. J. Chem. Inf. Comput. Sci. 43(3), 707–720 (2003). PMID: 12767129CrossRefGoogle Scholar
  9. 9.
    Gammerman, A., Vovk, V.: Hedging predictions in machine learning. Comput. J. 50(2), 151–163 (2007)CrossRefGoogle Scholar
  10. 10.
    Gärtner, T.: Kernels for Structured Data. World Scientific Publishing Co., Inc., River Edge (2009)Google Scholar
  11. 11.
    Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallel support vector machines: the cascade SVM. In: Advances in Neural Information Processing Systems, pp 521–528. MIT Press (2005)Google Scholar
  12. 12.
    Jain, A.N., Nicholls, A.: Recommendations for evaluation of computational methods. J. Comput. Aided Mol. Des. 22(3-4), 133–139 (2008)CrossRefGoogle Scholar
  13. 13.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn Res. 9, 371–421 (2008)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2005)zbMATHGoogle Scholar
  16. 16.
    Weis, D.C., Visco, D.P. Jr.: Jean-loup Faulon. Data mining pubchem using a support vector machine with the signature molecular descriptor Classification of factor {XIa} inhibitors. J. Mol. Graph. Model. 27(4), 466 –475 (2008)CrossRefGoogle Scholar
  17. 17.
    Holenz, J., et al. (eds.): Lead Generation: Methods and Strategies, vol. 68. Wiley-VCH (2016)Google Scholar
  18. 18.
    Woodsend, K., Gondziom, J.: Hybrid MPI/OpenMP parallel linear support vector machine training. J. Mach. Learn. Res. 10, 1937–1953 (2009)MathSciNetzbMATHGoogle Scholar
  19. 19.
    You, Y., Fu, H., Song, S.L., Randles, A., Kerbyson, D., Marquez, A., Yang, G., Hoisie, A.: Scaling support vector machines on modern HPC platforms. J. Parallel Distrib. Comput. 76(C), 16–31 (2015)CrossRefGoogle Scholar
  20. 20.
    Toccaceli, P., Nouretdinov, I., Gammerman, A.: Conformal predictors for compound activity prediction. In: COPA Proceedings of the 5th International Symposium on Conformal and Probabilistic Prediction with Applications, vol. 9653, p 2016. Springer-Verlag New York Inc. (2016)Google Scholar
  21. 21.
    Nouretdinov, I., Gammerman, A., Qi, Y., Klein-Seetharaman, J.: Determining confidence of predicted interactions between HIV-1 and human proteins using conformal method. Pac. Symp. Biocomput. 311 (2012)Google Scholar
  22. 22.
    Wang, Y., Suzek, T., Zhang, J., Wang, J., He, S., Cheng, T., Shoemaker, B.A., Gindulyte, A., Bryant, S.H.: Pubchem BioAssay: 2014 upyear. Nucleic Acids Res. 42(1), D1075–82 (2014)CrossRefGoogle Scholar
  23. 23.
    McCool, M., Robison, A.D., Reinders, J.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan-Kaufmann (2012)Google Scholar

Copyright information

© The Author(s) 2017

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Royal HollowayUniversity of LondonEghamUK

Personalised recommendations