Approach to Improving the Quality of Open Data in the Universe of Small Molecules
- 141 Downloads
We describe an approach to improving the quality and interoperability of open data related to small molecules, such as metabolites, drugs, natural products, food additives, and environmental contaminants. The approach involves computer implementation of an extended version of the IUPAC International Chemical Identifier (InChI) system that utilizes the three-dimensional structure of a compound to generate reproducible compound identifiers (standard InChI strings) and universally reproducible designators for all constituent atoms of each compound. These compound and atom identifiers enable reliable federation of information from a wide range of freely accessible databases. In addition, these designators provide a platform for the derivation and promulgation of information regarding the physical properties of these molecules. Examples of applications include, compound dereplication, derivation of force fields used in determination of three-dimensional structures and investigations of molecular interactions, and parameterization of NMR spin system matrices used in compound identification and quantification. We are developing a data definition language (DDL) and STAR-based data dictionary to support the storage and retrieval of these kinds of information in digital resources. The current database contains entries for more than 90 million unique compounds.
KeywordsCompound and atom identifiers FAIR principles Data dictionary Compound dereplication Nuclear magnetic resonance spectroscopy Mass spectrometry Force field description of small molecules
This work was funded in part by NIH Grants P41GM103399 in support of the National Magnetic Resonance Facility at Madison (NMRFAM), R01GM 109046 in support of the Biological Magnetic Resonance data Bank (BMRB), and P41GM111135 in support of the NMRbox project.
- 14.Dashti, H., et al.: Robust nomenclature and software for enhanced reproducibility in molecular modeling of small molecules. bioRxiv, 429530 (2018)Google Scholar