Abstract
We describe an approach to improving the quality and interoperability of open data related to small molecules, such as metabolites, drugs, natural products, food additives, and environmental contaminants. The approach involves computer implementation of an extended version of the IUPAC International Chemical Identifier (InChI) system that utilizes the three-dimensional structure of a compound to generate reproducible compound identifiers (standard InChI strings) and universally reproducible designators for all constituent atoms of each compound. These compound and atom identifiers enable reliable federation of information from a wide range of freely accessible databases. In addition, these designators provide a platform for the derivation and promulgation of information regarding the physical properties of these molecules. Examples of applications include, compound dereplication, derivation of force fields used in determination of three-dimensional structures and investigations of molecular interactions, and parameterization of NMR spin system matrices used in compound identification and quantification. We are developing a data definition language (DDL) and STAR-based data dictionary to support the storage and retrieval of these kinds of information in digital resources. The current database contains entries for more than 90 million unique compounds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dashti, H., Westler, W.M., Markley, J.L., Eghbalnia, H.R.: Unique identifiers for small molecules enable rigorous labeling of their atoms. Sci. Data 4, 170073 (2017)
Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016)
Dashti, H., Wedell, J.R., Westler, W.M., Markley, J.L., Eghbalnia, H.R.: Automated evaluation of consistency within the PubChem compound database. Sci. Data 6, 190023 (2019)
Ulrich, E.L., Argentar, D., Klimowicz, A., Markley, J.L.: STAR/CIF macromolecular NMR data dictionaries and data file formats. Acta Crystallogr. A 52(a1), C577–C577 (1996)
Ulrich, E.L., et al.: NMR-STAR: comprehensive ontology for representing, archiving and exchanging data from nuclear magnetic resonance spectroscopic experiments. J. Biomol. NMR 73, 5–9 (2019)
Hall, S.R., Spadaccini, N.: The STAR file: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508 (1994)
Hall, S.R., Cook, A.P.F.: STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825 (1995)
Spadaccini, N., Hall, S.R.: Extensions to the STAR file syntax. J. Chem. Inf. Model. 52, 1901–1906 (2012)
Bourne, P.E., Berman, H.M., McMahon, B., Watenpaugh, K.D., Westbrook, J.D., Fitzgerald, P.M.D.: The macromolecular crystallographic information file (mmCIF). Meth. Enzymol. 277, 571–590 (1997)
Dashti, H., Westler, W.M., Tonelli, M., Wedell, J.R., Markley, J.L., Eghbalnia, H.R.: Spin system modeling of nuclear magnetic resonance spectra for applications in metabolomics and small molecule screening. Anal. Chem. 89, 12201–12208 (2017)
Dashti, H., et al.: Applications of parametrized NMR spin systems of small molecules. Anal. Chem. 90, 10646–10649 (2018)
Pupier, M., et al.: NMReDATA, a standard to report the NMR assignment and parameters of organic compounds. Magn. Reson. Chem. 56, 703–715 (2018)
Cornilescu, G., et al.: Progressive stereo locking (PSL): a residual dipolar coupling based force field method for determining the relative configuration of natural products and other small molecules. ACS Chem. Biol. 12, 2157–2163 (2017)
Dashti, H., et al.: Robust nomenclature and software for enhanced reproducibility in molecular modeling of small molecules. bioRxiv, 429530 (2018)
Maciejewski, M.W., et al.: NMRbox: a resource for biomolecular NMR computation. Biophys. J. 112, 1529–1534 (2017)
Ulrich, E.L., et al.: BioMagResBank. Nucleic Acids Res. 36, 402–408 (2008)
Le Guennec, A., Tayyari, F., Edison, A.S.: Alternatives to nuclear overhauser enhancement spectroscopy presat and carr-purcell-meiboom-gill presat for NMR-based metabolomics. Anal. Chem. 89, 8582–8588 (2017)
Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. Meth. Mol. Biol. 1607, 627–641 (2017)
Acknowledgments
This work was funded in part by NIH Grants P41GM103399 in support of the National Magnetic Resonance Facility at Madison (NMRFAM), R01GM 109046 in support of the Biological Magnetic Resonance data Bank (BMRB), and P41GM111135 in support of the NMRbox project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Markley, J.L., Dashti, H., Wedell, J.R., Westler, W.M., Ulrich, E.L., Eghbalnia, H.R. (2019). Approach to Improving the Quality of Open Data in the Universe of Small Molecules. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems Workshops. BIS 2019. Lecture Notes in Business Information Processing, vol 373. Springer, Cham. https://doi.org/10.1007/978-3-030-36691-9_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-36691-9_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36690-2
Online ISBN: 978-3-030-36691-9
eBook Packages: Computer ScienceComputer Science (R0)