Post-processing of Large Bioactivity Data

  • Jason Bret Harris
Part of the Methods in Molecular Biology book series (MIMB, volume 1939)


Bioactivity data is a valuable scientific data type that needs to be findable, accessible, interoperable, and reusable (FAIR) (Wilkinson et al. Sci Data 3:160018, 2016). However, results from bioassay experiments often exist in formats that are difficult to interoperate across and reuse in follow-up research, especially when attempting to combine experimental records from many different sources. This chapter details common issues associated with the processing of large bioactivity data and methods for handling these issues in a post-processing scenario. Specifically described are observations from a recent effort (Harris,, 2017) to post-process massive amounts of bioactivity data from the NIH’s PubChem Bioassay repository (Wang et al., Nucleic Acids Res 42:1075–1082, 2014).

Key words

Bioactivity Bioassay ScrubChem PubChem Hit-calls Big data Data integration 


  1. 1.
    Harris J (2017) ScrubChem.
  2. 2.
    Wang Y et al (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:1075–1082Google Scholar
  3. 3.
    Bento AP et al (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42:1083–1090Google Scholar
  4. 4.
    Wishart DS et al (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34:D668–D672Google Scholar
  5. 5.
    Gilson MK et al (2016) BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res 44:D1045–D1053Google Scholar
  6. 6.
    Toxicology in the 21st CenturyGoogle Scholar
  7. 7.
    Dix DJ et al (2007) The toxcast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci 95:5–12Google Scholar
  8. 8.
    Davis AP et al (2017) The comparative Toxicogenomics database: update 2017. Nucleic Acids Res 45:D972–D978Google Scholar
  9. 9.
    Nguyen DT et al (2017) Pharos: collating protein information to shed light on the druggable genome. Nucleic Acids Res 45:D995–D1002Google Scholar
  10. 10.
    Pilarczyk M, Medvedovic M, Fazel Najafabadi M, Naim M, Michal K, Nicholas C, Shana W, Mark B, Wen N, John R, Juozas V, Jarek M, Mario M (2016) iLINCS: Web-Platform For Analysis Of Lincs Data And Signatures,
  11. 11.
    Wilkinson MD et al (2016) The FAIR guiding principles for scientific data management and stewardship. Sci Data 3:160018Google Scholar
  12. 12.
    Visser U et al (2011) BioAssay ontology (BAO): a semantic description of bioassays and high-throughput screening results. BMC Bioinformatics 12:257Google Scholar
  13. 13.
    Orchard S et al (2011) Minimum information about a bioactive entity (MIABE). Nat Rev Drug Discov 10:661–669Google Scholar
  14. 14.
  15. 15.
  16. 16.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Jason Bret Harris
    • 1
  1. 1.Collaborative Drug Discovery (CDD), Inc.BurlingameUSA

Personalised recommendations