Abstract
Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, such as metabolomics, systems biology, and drug discovery. However, no such database can be complete, and the chemical structure for a given compound is not necessarily consistent between databases. This paper presents StructRecon, a novel tool for resolving unique and correct molecular structures from database identifiers. StructRecon traverses the cross-links between database entries in different databases to construct what we call an identifier graph, which offers a more complete view of the total information available on a particular compound across all the databases. In order to reconcile discrepancies between databases, we first present an extensible model for chemical structure which supports multiple independent levels of detail, allowing standardisation of the structure to be applied iteratively. In some cases, our standardisation approach results in multiple structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternates. We applied StructRecon to the EColiCore2 model, resolving a unique chemical structure for 85.11% of identifiers. StructRecon is open-source and modular, which enables the potential support for more databases in the future.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akhondi, S.A., Kors, J.A., Muresan, S.: Consistency of systematic chemical identifiers within and between small-molecule databases. J. Cheminform. 4, 35 (2012). https://doi.org/10.1186/1758-2946-4-35
Akutsu, T.: A new method of computer representation of stereochemistry. Transforming a stereochemical structure into a graph. J. Chem. Inf. Comput. Sci. 31(3) (1991). https://doi.org/10.1021/ci00003a008
Andersen, J.L., Flamm, C., Merkle, D., Stadler, P.F.: Chemical graph transformation with stereo-information. In: de Lara, J., Plump, D. (eds.) ICGT 2017. LNCS, vol. 10373, pp. 54–69. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61470-0_4
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1) (1998). https://doi.org/10.1016/S0169-7552(98)00110-X
Degtyarenko, K., et al.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36(Database issue), D344–D350 (2008). https://doi.org/10.1093/nar/gkm791
Fourches, D., Muratov, E., Tropsha, A.: Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model. 50(7), 1189–1204 (2010). https://doi.org/10.1021/ci100176x
Ganter, M., Bernard, T., Moretti, S., Stelling, J., Pagni, M.: MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks. Bioinformatics 29(6), 815–816 (2013). https://doi.org/10.1093/bioinformatics/btt036
Guo, A.C., et al.: ECMDB: the e.coli metabolome database. Nucleic Acids Res. 41(Database issue), D625–630 (2013). https://doi.org/10.1093/nar/gks992
Hädicke, O., Klamt, S.: Ecolicore2: a reference network model of the central metabolism of escherichia coli and relationships to its genome-scale parent model. Sci. Rep. 7(11) (2017). https://doi.org/10.1038/srep39647
Heller, S.R., McNaught, A., Pletnev, I., Stein, S., Tchekhovskoi, D.: InChI, the IUPAC international chemical identifier. J. Cheminform. 7(1), 1–34 (2015). https://doi.org/10.1186/s13321-015-0068-4
International Union of Pure and Applied Chemistry Commission on the Nomenclature of Organic Chemistry, Klesney, S.P.: Nomenclature of Organic Chemistry (1979)
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017). https://doi.org/10.1093/nar/gkw1092
Kim, S., et al.: PubChem 2023 update. Nucleic Acids Res. 51(D1), D1373–D1380 (2023). https://doi.org/10.1093/nar/gkac956
King, Z.A., et al.: BiGG models: a platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res. 44(D1), D515–D522 (2016). https://doi.org/10.1093/nar/gkv1049
Muresan, S., Sitzmann, M., Southan, C.: Mapping between databases of compounds and protein targets. Methods Mol. Biol. (Clifton, N.J.) 910, 145–164 (2012). https://doi.org/10.1007/978-1-61779-965-5_8
Petrarca, A.E., Lynch, M.F., Rush, J.E.: A method for generating unique computer structural representations of stereoisomers. J. Chem. Doc. 7(3) (1967). https://doi.org/10.1021/c160026a008
RDKit: Open-source cheminformatics software. https://www.rdkit.org/
Sajed, T., et al.: ECMDB 2.0: a richer resource for understanding the biochemistry of e.coli. Nucleic Acids Res. 44(D1), D495–501 (2016). https://doi.org/10.1093/nar/gkv1060
Sitzmann, M., Filippov, I., Nicklaus, M.: Internet resources integrating many small-molecule databases1. SAR QSAR Environ. Res. 19(1–2), 1–9 (2008). https://doi.org/10.1080/10629360701843540
Stein, S.E., Heller, S.R., Tchekhovskoi, D.V.: The IUPAC Chemical Identifier - Technical Manual (2011)
Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988). https://doi.org/10.1021/ci00057a005
Weininger, D.: SMILES. 3. DEPICT. Graphical depiction of chemical structures. J. Chem. Inf. Comput. Sci. 30(3), 237–243 (1990). https://doi.org/10.1021/ci00067a005
Weininger, D., Weininger, A., Weininger, J.L.: SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29(2), 97–101 (1989). https://doi.org/10.1021/ci00062a008
Williams, A.J., Ekins, S.: A quality alert and call for improved curation of public chemistry databases. Drug Discov. Today 16(17), 747–750 (2011). https://doi.org/10.1016/j.drudis.2011.07.007
Williams, A.J., Ekins, S., Tkachenko, V.: Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov. Today 17(13), 685–701 (2012). https://doi.org/10.1016/j.drudis.2012.02.013
Young, D., Martin, T., Venkatapathy, R., Harten, P.: Are the chemical structures in your QSAR correct? QSAR Comb. Sci. 27(11–12), 1337–1345 (2008). https://doi.org/10.1002/qsar.200810084
Acknowledgements
This work is supported by the Novo Nordisk Foundation grants NNF19OC0057834 and NNF21OC0066551.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A
Appendix A
See Fig. 4.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Eriksen, C.A., Andersen, J.L., Fagerberg, R., Merkle, D. (2023). Reconciling Inconsistent Molecular Structures from Biochemical Databases. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-7074-2_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7073-5
Online ISBN: 978-981-99-7074-2
eBook Packages: Computer ScienceComputer Science (R0)