Skip to main content
Log in

Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches

  • Research Paper
  • Published:
Science China Life Sciences Aims and scope Submit manuscript

Abstract

Artificial intelligence (AI) models usually require large amounts of high-quality training data, which is in striking contrast to the situation of small and biased data faced by current drug discovery pipelines. The concept of federated learning has been proposed to utilize distributed data from different sources without leaking sensitive information of the data. This emerging decentralized machine learning paradigm is expected to dramatically improve the success rate of AI-powered drug discovery. Here, we simulated the federated learning process with different property and activity datasets from different sources, among which overlapping molecules with high or low biases exist in the recorded values. Beyond the benefit of gaining more data, we also demonstrated that federated training has a regularization effect superior to centralized training on the pooled datasets with high biases. Moreover, different network architectures for clients and aggregation algorithms for coordinators have been compared on the performance of federated learning, where personalized federated learning shows promising results. Our work demonstrates the applicability of federated learning in predicting drug-related properties and highlights its promising role in addressing the small and biased data dilemma in drug discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ai, X., Sun, Y., Wang, H., and Lu, S. (2014). A systematic profile of clinical inhibitors responsive to EGFR somatic amino acid mutations in lung cancer: implication for the molecular mechanism of drug resistance and sensitivity. Amino Acids 46, 1635–1648.

    Article  CAS  PubMed  Google Scholar 

  • Anastassiadis, T., Deacon, S.W., Devarajan, K., Ma, H., and Peterson, J.R. (2011). Comprehensive assay of kinase catalytic activity reveals features of kinase inhibitor selectivity. Nat Biotechnol 29, 1039–1045.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Anwar-Mohamed, A., Barakat, K.H., Bhat, R., Noskov, S.Y., Tyrrell, D.L., Tuszynski, J.A., and Houghton, M. (2014). A human ether-á-go-go-related (hERG) ion channel atomistic model generated by long supercomputer molecular dynamics simulations and its use in predicting drug cardiotoxicity. Toxicol Lett 230, 382–392.

    Article  CAS  PubMed  Google Scholar 

  • Aronov, A.M., and Goldman, B.B. (2004). A model for identifying HERG K+ channel blockers. Bioorg Med Chem 12, 2307–2315.

    Article  CAS  PubMed  Google Scholar 

  • Aronov, A. (2005). Predictive in silico modeling for hERG channel blockers. Drug Discov Today 10, 149–155.

    Article  CAS  PubMed  Google Scholar 

  • Beaugrand, M., Arnold, A.A., Bourgault, S., Williamson, P.T.F., and Marcotte, I. (2017). Comparative study of the structure and interaction of the pore helices of the hERG and Kv1.5 potassium channels in model membranes. Eur Biophys J 46, 549–559.

    Article  CAS  PubMed  Google Scholar 

  • Benson, A.P., Al-Owais, M., and Holden, A.V. (2011). Quantitative prediction of the arrhythmogenic effects of de novo hERG mutations in computational models of human ventricular tissues. Eur Biophys J 40, 627–639.

    Article  PubMed  Google Scholar 

  • Bento, A.P., Gaulton, A., Hersey, A., Bellis, L.J., Chambers, J., Davies, M., Krüger, F.A., Light, Y., Mak, L., McGlinchey, S., et al. (2014). The ChEMBL bioactivity database: an update. Nucl Acids Res 42, D1083–D1090.

    Article  CAS  PubMed  Google Scholar 

  • Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H.B., Patel, S., Ramage, D., Segal, A., and Seth, K. (2017). Practical Secure Aggregation for privacy-preserving machine learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: Association for Computing Machinery. 1175–1191.

    Google Scholar 

  • Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Konečný J., Mazzocchi, S., McMahan, H.B., et al. (2019). Towards federated learning at scale: system design. arXiv, 1902.01046.

  • Braga, R.C., Alves, V.M., Silva, M.F.B., Muratov, E., Fourches, D., Lião L.M., Tropsha, A., and Andrade, C.H. (2015). Pred-hERG: a novel web-accessible computational tool for predicting cardiac toxicity. Mol Inf 34, 698–701.

    Article  CAS  Google Scholar 

  • Cai, C., Guo, P., Zhou, Y., Zhou, J., Wang, Q., Zhang, F., Fang, J., and Cheng, F. (2019). Deep learning-based prediction of drug-induced cardiotoxicity. J Chem Inf Model 59, 1073–1084.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Chen, B., Garmire, L., Calvisi, D.F., Chua, M.S., Kelley, R.K., and Chen, X. (2020). Harnessing big ‘omics’ data and AI for drug discovery in hepatocellular carcinoma. Nat Rev Gastroenterol Hepatol 17, 238–251.

    Article  PubMed  PubMed Central  Google Scholar 

  • Chen, S., Xue, D., Chuai, G., Yang, Q., and Liu, Q. (2021). FL-QSAR: a federated learning-based QSAR prototype for collaborative drug discovery. Bioinformatics 36, 5492–5498.

    Article  PubMed  Google Scholar 

  • Christmann-Franck, S., van Westen, G.J.P., Papadatos, G., Beltran Escudie, F., Roberts, A., Overington, J.P., and Domine, D. (2016). Unprecedently large-scale kinase inhibitor set enabling the accurate prediction of compound-kinase activities: a way toward selective promiscuity by design? J Chem Inf Model 56, 1654–1675.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Daub, H., Specht, K., and Ullrich, A. (2004). Strategies to overcome resistance to targeted protein kinase inhibitors. Nat Rev Drug Discov 3, 1001–1010.

    Article  CAS  PubMed  Google Scholar 

  • Davis, M.I., Hunt, J.P., Herrgard, S., Ciceri, P., Wodicka, L.M., Pallares, G., Hocker, M., Treiber, D.K., and Zarrinkar, P.P. (2011). Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29, 1046–1051.

    Article  CAS  PubMed  Google Scholar 

  • Delaney, J.S. (2004). ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci 44, 1000–1005.

    Article  CAS  PubMed  Google Scholar 

  • Doddareddy, M.R., Klaasse, E.C., Shagufta, E., IJzerman, A.P., and Bender, A. (2010). Prospective validation of a comprehensive in silico hERG model and its applications to commercial compound and drug databases. Chemmedchem 5, 716–729.

    Article  CAS  PubMed  Google Scholar 

  • Dranchak, P., MacArthur, R., Guha, R., Zuercher, W.J., Drewry, D.H., Auld, D.S., and Inglese, J. (2013). Profile of the GSK published protein kinase inhibitor set across ATP-dependent and-independent luciferases: implications for reporter-gene assays. PLoS ONE 8, e57888.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Elkins, J.M., Fedele, V., Szklarz, M., Abdul Azeez, K.R., Salah, E., Mikolajczyk, J., Romanov, S., Sepetov, N., Huang, X.P., Roth, B.L., et al. (2015). Comprehensive characterization of the Published Kinase Inhibitor Set. Nat Biotechnol 34, 95–103.

    Article  PubMed  Google Scholar 

  • Haddadpour, F., Kamani, M.M., Mahdavi, M., and Cadambe, V.R. (2019). Local SGD with periodic averaging: tighter analysis and adaptive synchronization. arXiv, 1910.13598.

  • Huang, Y., Chu, L., Zhou, Z., Wang, L., Liu, J., Pei, J., and Zhang, Y. (2020). Personalized federated learning: an attentive collaboration approach. arXiv, 2007.03797.

  • Hunter, A.J., Lee, W.H., and Bountra, C. (2018). Open innovation in neuroscience research and drug discovery. Brain Neurosci Adv 2, 239821281879927.

    Article  Google Scholar 

  • Huuskonen, J. (2000). Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J Chem Inf Comput Sci 40, 773–777.

    Article  CAS  PubMed  Google Scholar 

  • Jiang, Y., Konečný J., Rush, K., and Kannan, S. (2019). Improving federated learning personalization via model agnostic meta learning. arXiv, 1909.12488.

  • Kairouz, P., McMahan, H.B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. (2019). Advances and open problems in federated learning. arXiv, 1912.04977.

  • Kaissis, G.A., Makowski, M.R., Rückert, D., and Braren, R.F. (2020). Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2, 305–311.

    Article  Google Scholar 

  • Keserü G.M. (2003). Prediction of hERG potassium channel affinity by traditional and hologram QSAR methods. Bioorg Med Chem Lett 13, 2773–2775.

    Article  PubMed  Google Scholar 

  • Knapp, S., Arruda, P., Blagg, J., Burley, S., Drewry, D.H., Edwards, A., Fabbro, D., Gillespie, P., Gray, N.S., Kuster, B., et al. (2013). A public-private partnership to unlock the untargeted kinome. Nat Chem Biol 9, 3–6.

    Article  CAS  PubMed  Google Scholar 

  • Li, W., Milletarì, F., Xu, D., Rieke, N., Hancox, J., Zhu, W., Baust, M., Cheng, Y., Ourselin, S., Cardoso, M.J., et al. (2019). Privacy-preserving federated brain tumour segmentation. In: Suk, H.I., Liu, M., Yan, P., and Lian, C., eds. Machine Learning in Medical Imaging. MLMI 2019. Cham: Springer. 133–141.

    Chapter  Google Scholar 

  • Liu, L., Lu, J., Lu, Y., Zheng, M., Luo, X., Zhu, W., Jiang, H., and Chen, K. (2014). Novel Bayesian classification models for predicting compounds blocking hERG potassium channels. Acta Pharmacol Sin 35, 1093–1102.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Liu, D., Xu, C., He, W., Xu, Z., Fu, W., Zhang, L., Yang, J., Peng, G., Han, D., Bai, X., et al. (2019). AutoGenome: an autoML tool for genomic research. bioRxiv, 10.1101/842526.

  • Ma, R., Li, Y., Li, C., Wan, F., Hu, H., Xu, W., and Zeng, J. (2020). Secure multiparty computation for privacy-preserving drug discovery. Bioinformatics 36, 2872–2880.

    Article  CAS  PubMed  Google Scholar 

  • McMahan, B., Moore, E., Ramage, D., Hampson, S. and Arcas, B.A.Y. (2017). Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale: PMLR. 1273–1282.

    Google Scholar 

  • Merget, B., Turk, S., Eid, S., Rippmann, F., and Fulle, S. (2017). Profiling prediction of kinase inhibitors: toward the virtual assay. J Med Chem 60, 474–485.

    Article  CAS  PubMed  Google Scholar 

  • Metz, J.T., Johnson, E.F., Soni, N.B., Merta, P.J., Kifle, L., and Hajduk, P.J. (2011). Navigating the kinome. Nat Chem Biol 7, 200–202.

    Article  CAS  PubMed  Google Scholar 

  • Raevsky, O.A., Grigor’ev, V.Y., Polianczyk, D.E., Raevskaja, O.E., and Dearden, J.C. (2014). Calculation of aqueous solubility of crystalline un-ionized organic chemicals and drugs based on structural similarity and physicochemical descriptors. J Chem Inf Model 54, 683–691.

    Article  CAS  PubMed  Google Scholar 

  • Riley, P. (2019). Three pitfalls to avoid in machine learning. Nature 572, 27–29.

    Article  CAS  PubMed  Google Scholar 

  • Rogers, D., and Hahn, M. (2010). Extended-connectivity fingerprints. J Chem Inf Model 50, 742–754.

    Article  CAS  PubMed  Google Scholar 

  • Schneider, P., Walters, W.P., Plowright, A.T., Sieroka, N., Listgarten, J., Goodnow Jr. R.A., Fisher, J., Jansen, J.M., Duca, J.S., Rush, T.S., et al. (2020). Rethinking drug design in the artificial intelligence era. Nat Rev Drug Discov 19, 353–364.

    Google Scholar 

  • Siramshetty, V.B., Nguyen, D.T., Martinez, N.J., Southall, N.T., Simeonov, A., and Zakharov, A.V. (2020). Critical assessment of artificial intelligence methods for prediction of hERG channel inhibition in the “big data” era. J Chem Inf Model 60, 6007–6019.

    Article  CAS  PubMed  Google Scholar 

  • Smalley, E. (2017). AI-powered drug discovery captures pharma interest. Nat Biotechnol 35, 604–605.

    Article  CAS  PubMed  Google Scholar 

  • Smirnov, E.A., Timoshenko, D.M., and Andrianov, S.N. (2014). Comparison of regularization methods for ImageNet classification with deep convolutional neural networks. AASRI Procedia 6, 89–94.

    Article  Google Scholar 

  • Sorkun, M.C., Khetan, A., and Er, S. (2019). AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data 6, 143.

    Article  PubMed  PubMed Central  Google Scholar 

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15, 1929–1958.

    Google Scholar 

  • Sun, X., Xu, B., Xue, Y., Li, H., Zhang, H., Zhang, Y., Kang, L., Zhang, X., Zhang, J., Jia, Z., et al. (2017). Characterization and structure-activity relationship of natural flavonoids as hERG K+ channel modulators. Int Immunopharmacol 45, 187–193.

    Article  CAS  PubMed  Google Scholar 

  • Tang, J., Szwajda, A., Shakyawar, S., Xu, T., Hintsanen, P., Wennerberg, K., and Aittokallio, T. (2014). Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model 54, 735–743.

    Article  CAS  PubMed  Google Scholar 

  • Volkamer, A., Eid, S., Turk, S., Jaeger, S., Rippmann, F., and Fulle, S. (2015). Pocketome of human kinases: prioritizing the ATP binding sites of (yet) untapped protein kinases for drug discovery. J Chem Inf Model 55, 538–549.

    Article  PubMed  Google Scholar 

  • Wang, J., Hou, T., and Xu, X. (2009). Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas. J Chem Inf Model 49, 571–581.

    Article  CAS  PubMed  Google Scholar 

  • Wang, K., Mathews, R., Kiddon, C., Eichner, H., Beaufays, F., and Ramage, D. (2019). Federated evaluation of on-device personalization. arXiv, 1910.10252.

  • Yang, Q., Liu, Y., Chen, T., and Tong, Y. (2019). Federated machine learning: concept and applications. arXiv, 1902.04885.

  • Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong, N., Ramage, D., and Beaufays, F. (2018). Applied federated learning: improving Google keyboard query suggestions. arXiv, 1812.02903.

  • Zhang, S., Zhou, Z., Gong, Q., Makielski, J.C., and January, C.T. (1999). Mechanism of block and identification of the verapamil binding domain to HERG potassium channels. Circ Res 84, 989–998.

    Article  CAS  PubMed  Google Scholar 

  • Zhang, W., Roederer, M.W., Chen, W.Q., Fan, L., and Zhou, H.H. (2012). Pharmacogenetics of drugs withdrawn from the market. Pharmacogenomics 13, 223–231.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by the Shanghai Municipal Science and Technology Major Project, the National Natural Science Foundation of China (81773634), the National Science and Technology Major Project of the Ministry of Science and Technology of China (2018ZX09711002), and the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA12050201 and XDA12020368).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hualiang Jiang, Nan Qiao or Mingyue Zheng.

Ethics declarations

Compliance and ethics The author(s) declare that they have no conflict of interest.

Supporting Information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiong, Z., Cheng, Z., Lin, X. et al. Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches. Sci. China Life Sci. 65, 529–539 (2022). https://doi.org/10.1007/s11427-021-1946-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11427-021-1946-0

Keywords

Navigation