Skip to main content

Approach to Improving the Quality of Open Data in the Universe of Small Molecules

  • Conference paper
  • First Online:
Business Information Systems Workshops (BIS 2019)


We describe an approach to improving the quality and interoperability of open data related to small molecules, such as metabolites, drugs, natural products, food additives, and environmental contaminants. The approach involves computer implementation of an extended version of the IUPAC International Chemical Identifier (InChI) system that utilizes the three-dimensional structure of a compound to generate reproducible compound identifiers (standard InChI strings) and universally reproducible designators for all constituent atoms of each compound. These compound and atom identifiers enable reliable federation of information from a wide range of freely accessible databases. In addition, these designators provide a platform for the derivation and promulgation of information regarding the physical properties of these molecules. Examples of applications include, compound dereplication, derivation of force fields used in determination of three-dimensional structures and investigations of molecular interactions, and parameterization of NMR spin system matrices used in compound identification and quantification. We are developing a data definition language (DDL) and STAR-based data dictionary to support the storage and retrieval of these kinds of information in digital resources. The current database contains entries for more than 90 million unique compounds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Dashti, H., Westler, W.M., Markley, J.L., Eghbalnia, H.R.: Unique identifiers for small molecules enable rigorous labeling of their atoms. Sci. Data 4, 170073 (2017)

    Article  Google Scholar 

  2. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)

    Article  Google Scholar 

  3. Dashti, H., Wedell, J.R., Westler, W.M., Markley, J.L., Eghbalnia, H.R.: Automated evaluation of consistency within the PubChem compound database. Sci. Data 6, 190023 (2019)

    Article  Google Scholar 

  4. Ulrich, E.L., Argentar, D., Klimowicz, A., Markley, J.L.: STAR/CIF macromolecular NMR data dictionaries and data file formats. Acta Crystallogr. A 52(a1), C577–C577 (1996)

    Article  Google Scholar 

  5. Ulrich, E.L., et al.: NMR-STAR: comprehensive ontology for representing, archiving and exchanging data from nuclear magnetic resonance spectroscopic experiments. J. Biomol. NMR 73, 5–9 (2019)

    Article  Google Scholar 

  6. Hall, S.R., Spadaccini, N.: The STAR file: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508 (1994)

    Article  Google Scholar 

  7. Hall, S.R., Cook, A.P.F.: STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825 (1995)

    Article  Google Scholar 

  8. Spadaccini, N., Hall, S.R.: Extensions to the STAR file syntax. J. Chem. Inf. Model. 52, 1901–1906 (2012)

    Article  Google Scholar 

  9. Bourne, P.E., Berman, H.M., McMahon, B., Watenpaugh, K.D., Westbrook, J.D., Fitzgerald, P.M.D.: The macromolecular crystallographic information file (mmCIF). Meth. Enzymol. 277, 571–590 (1997)

    Article  Google Scholar 

  10. Dashti, H., Westler, W.M., Tonelli, M., Wedell, J.R., Markley, J.L., Eghbalnia, H.R.: Spin system modeling of nuclear magnetic resonance spectra for applications in metabolomics and small molecule screening. Anal. Chem. 89, 12201–12208 (2017)

    Article  Google Scholar 

  11. Dashti, H., et al.: Applications of parametrized NMR spin systems of small molecules. Anal. Chem. 90, 10646–10649 (2018)

    Article  Google Scholar 

  12. Pupier, M., et al.: NMReDATA, a standard to report the NMR assignment and parameters of organic compounds. Magn. Reson. Chem. 56, 703–715 (2018)

    Article  Google Scholar 

  13. Cornilescu, G., et al.: Progressive stereo locking (PSL): a residual dipolar coupling based force field method for determining the relative configuration of natural products and other small molecules. ACS Chem. Biol. 12, 2157–2163 (2017)

    Article  Google Scholar 

  14. Dashti, H., et al.: Robust nomenclature and software for enhanced reproducibility in molecular modeling of small molecules. bioRxiv, 429530 (2018)

    Google Scholar 

  15. Maciejewski, M.W., et al.: NMRbox: a resource for biomolecular NMR computation. Biophys. J. 112, 1529–1534 (2017)

    Article  Google Scholar 

  16. Ulrich, E.L., et al.: BioMagResBank. Nucleic Acids Res. 36, 402–408 (2008)

    Article  Google Scholar 

  17. Le Guennec, A., Tayyari, F., Edison, A.S.: Alternatives to nuclear overhauser enhancement spectroscopy presat and carr-purcell-meiboom-gill presat for NMR-based metabolomics. Anal. Chem. 89, 8582–8588 (2017)

    Article  Google Scholar 

  18. Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. Meth. Mol. Biol. 1607, 627–641 (2017)

    Article  Google Scholar 

Download references


This work was funded in part by NIH Grants P41GM103399 in support of the National Magnetic Resonance Facility at Madison (NMRFAM), R01GM 109046 in support of the Biological Magnetic Resonance data Bank (BMRB), and P41GM111135 in support of the NMRbox project.

Author information

Authors and Affiliations


Corresponding author

Correspondence to John L. Markley .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Markley, J.L., Dashti, H., Wedell, J.R., Westler, W.M., Ulrich, E.L., Eghbalnia, H.R. (2019). Approach to Improving the Quality of Open Data in the Universe of Small Molecules. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems Workshops. BIS 2019. Lecture Notes in Business Information Processing, vol 373. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36690-2

  • Online ISBN: 978-3-030-36691-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics