Approach to Improving the Quality of Open Data in the Universe of Small Molecules

  • John L. MarkleyEmail author
  • Hesam Dashti
  • Jonathan R. Wedell
  • William M. Westler
  • Eldon L. Ulrich
  • Hamid R. Eghbalnia
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 373)


We describe an approach to improving the quality and interoperability of open data related to small molecules, such as metabolites, drugs, natural products, food additives, and environmental contaminants. The approach involves computer implementation of an extended version of the IUPAC International Chemical Identifier (InChI) system that utilizes the three-dimensional structure of a compound to generate reproducible compound identifiers (standard InChI strings) and universally reproducible designators for all constituent atoms of each compound. These compound and atom identifiers enable reliable federation of information from a wide range of freely accessible databases. In addition, these designators provide a platform for the derivation and promulgation of information regarding the physical properties of these molecules. Examples of applications include, compound dereplication, derivation of force fields used in determination of three-dimensional structures and investigations of molecular interactions, and parameterization of NMR spin system matrices used in compound identification and quantification. We are developing a data definition language (DDL) and STAR-based data dictionary to support the storage and retrieval of these kinds of information in digital resources. The current database contains entries for more than 90 million unique compounds.


Compound and atom identifiers FAIR principles Data dictionary Compound dereplication Nuclear magnetic resonance spectroscopy Mass spectrometry Force field description of small molecules 



This work was funded in part by NIH Grants P41GM103399 in support of the National Magnetic Resonance Facility at Madison (NMRFAM), R01GM 109046 in support of the Biological Magnetic Resonance data Bank (BMRB), and P41GM111135 in support of the NMRbox project.


  1. 1.
    Dashti, H., Westler, W.M., Markley, J.L., Eghbalnia, H.R.: Unique identifiers for small molecules enable rigorous labeling of their atoms. Sci. Data 4, 170073 (2017)CrossRefGoogle Scholar
  2. 2.
    Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)CrossRefGoogle Scholar
  3. 3.
    Dashti, H., Wedell, J.R., Westler, W.M., Markley, J.L., Eghbalnia, H.R.: Automated evaluation of consistency within the PubChem compound database. Sci. Data 6, 190023 (2019)CrossRefGoogle Scholar
  4. 4.
    Ulrich, E.L., Argentar, D., Klimowicz, A., Markley, J.L.: STAR/CIF macromolecular NMR data dictionaries and data file formats. Acta Crystallogr. A 52(a1), C577–C577 (1996)CrossRefGoogle Scholar
  5. 5.
    Ulrich, E.L., et al.: NMR-STAR: comprehensive ontology for representing, archiving and exchanging data from nuclear magnetic resonance spectroscopic experiments. J. Biomol. NMR 73, 5–9 (2019)CrossRefGoogle Scholar
  6. 6.
    Hall, S.R., Spadaccini, N.: The STAR file: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508 (1994)CrossRefGoogle Scholar
  7. 7.
    Hall, S.R., Cook, A.P.F.: STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825 (1995)CrossRefGoogle Scholar
  8. 8.
    Spadaccini, N., Hall, S.R.: Extensions to the STAR file syntax. J. Chem. Inf. Model. 52, 1901–1906 (2012)CrossRefGoogle Scholar
  9. 9.
    Bourne, P.E., Berman, H.M., McMahon, B., Watenpaugh, K.D., Westbrook, J.D., Fitzgerald, P.M.D.: The macromolecular crystallographic information file (mmCIF). Meth. Enzymol. 277, 571–590 (1997)CrossRefGoogle Scholar
  10. 10.
    Dashti, H., Westler, W.M., Tonelli, M., Wedell, J.R., Markley, J.L., Eghbalnia, H.R.: Spin system modeling of nuclear magnetic resonance spectra for applications in metabolomics and small molecule screening. Anal. Chem. 89, 12201–12208 (2017)CrossRefGoogle Scholar
  11. 11.
    Dashti, H., et al.: Applications of parametrized NMR spin systems of small molecules. Anal. Chem. 90, 10646–10649 (2018)CrossRefGoogle Scholar
  12. 12.
    Pupier, M., et al.: NMReDATA, a standard to report the NMR assignment and parameters of organic compounds. Magn. Reson. Chem. 56, 703–715 (2018)CrossRefGoogle Scholar
  13. 13.
    Cornilescu, G., et al.: Progressive stereo locking (PSL): a residual dipolar coupling based force field method for determining the relative configuration of natural products and other small molecules. ACS Chem. Biol. 12, 2157–2163 (2017)CrossRefGoogle Scholar
  14. 14.
    Dashti, H., et al.: Robust nomenclature and software for enhanced reproducibility in molecular modeling of small molecules. bioRxiv, 429530 (2018)Google Scholar
  15. 15.
    Maciejewski, M.W., et al.: NMRbox: a resource for biomolecular NMR computation. Biophys. J. 112, 1529–1534 (2017)CrossRefGoogle Scholar
  16. 16.
    Ulrich, E.L., et al.: BioMagResBank. Nucleic Acids Res. 36, 402–408 (2008)CrossRefGoogle Scholar
  17. 17.
    Le Guennec, A., Tayyari, F., Edison, A.S.: Alternatives to nuclear overhauser enhancement spectroscopy presat and carr-purcell-meiboom-gill presat for NMR-based metabolomics. Anal. Chem. 89, 8582–8588 (2017)CrossRefGoogle Scholar
  18. 18.
    Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. Meth. Mol. Biol. 1607, 627–641 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • John L. Markley
    • 1
    Email author
  • Hesam Dashti
    • 2
  • Jonathan R. Wedell
    • 1
  • William M. Westler
    • 1
  • Eldon L. Ulrich
    • 1
  • Hamid R. Eghbalnia
    • 1
  1. 1.University of Wisconsin-MadisonMadisonUSA
  2. 2.Harvard Medical School, Medical SchoolBostonUSA

Personalised recommendations