Skip to main content

Reconciling Inconsistent Molecular Structures from Biochemical Databases

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 14248))

Included in the following conference series:

Abstract

Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, such as metabolomics, systems biology, and drug discovery. However, no such database can be complete, and the chemical structure for a given compound is not necessarily consistent between databases. This paper presents StructRecon, a novel tool for resolving unique and correct molecular structures from database identifiers. StructRecon traverses the cross-links between database entries in different databases to construct what we call an identifier graph, which offers a more complete view of the total information available on a particular compound across all the databases. In order to reconcile discrepancies between databases, we first present an extensible model for chemical structure which supports multiple independent levels of detail, allowing standardisation of the structure to be applied iteratively. In some cases, our standardisation approach results in multiple structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternates. We applied StructRecon to the EColiCore2 model, resolving a unique chemical structure for 85.11% of identifiers. StructRecon is open-source and modular, which enables the potential support for more databases in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Akhondi, S.A., Kors, J.A., Muresan, S.: Consistency of systematic chemical identifiers within and between small-molecule databases. J. Cheminform. 4, 35 (2012). https://doi.org/10.1186/1758-2946-4-35

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Akutsu, T.: A new method of computer representation of stereochemistry. Transforming a stereochemical structure into a graph. J. Chem. Inf. Comput. Sci. 31(3) (1991). https://doi.org/10.1021/ci00003a008

  3. Andersen, J.L., Flamm, C., Merkle, D., Stadler, P.F.: Chemical graph transformation with stereo-information. In: de Lara, J., Plump, D. (eds.) ICGT 2017. LNCS, vol. 10373, pp. 54–69. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61470-0_4

    Chapter  Google Scholar 

  4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1) (1998). https://doi.org/10.1016/S0169-7552(98)00110-X

  5. Degtyarenko, K., et al.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36(Database issue), D344–D350 (2008). https://doi.org/10.1093/nar/gkm791

  6. Fourches, D., Muratov, E., Tropsha, A.: Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model. 50(7), 1189–1204 (2010). https://doi.org/10.1021/ci100176x

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Ganter, M., Bernard, T., Moretti, S., Stelling, J., Pagni, M.: MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks. Bioinformatics 29(6), 815–816 (2013). https://doi.org/10.1093/bioinformatics/btt036

  8. Guo, A.C., et al.: ECMDB: the e.coli metabolome database. Nucleic Acids Res. 41(Database issue), D625–630 (2013). https://doi.org/10.1093/nar/gks992

  9. Hädicke, O., Klamt, S.: Ecolicore2: a reference network model of the central metabolism of escherichia coli and relationships to its genome-scale parent model. Sci. Rep. 7(11) (2017). https://doi.org/10.1038/srep39647

  10. Heller, S.R., McNaught, A., Pletnev, I., Stein, S., Tchekhovskoi, D.: InChI, the IUPAC international chemical identifier. J. Cheminform. 7(1), 1–34 (2015). https://doi.org/10.1186/s13321-015-0068-4

    Article  CAS  Google Scholar 

  11. International Union of Pure and Applied Chemistry Commission on the Nomenclature of Organic Chemistry, Klesney, S.P.: Nomenclature of Organic Chemistry (1979)

    Google Scholar 

  12. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017). https://doi.org/10.1093/nar/gkw1092

    Article  CAS  PubMed  Google Scholar 

  13. Kim, S., et al.: PubChem 2023 update. Nucleic Acids Res. 51(D1), D1373–D1380 (2023). https://doi.org/10.1093/nar/gkac956

    Article  PubMed  Google Scholar 

  14. King, Z.A., et al.: BiGG models: a platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res. 44(D1), D515–D522 (2016). https://doi.org/10.1093/nar/gkv1049

    Article  CAS  PubMed  Google Scholar 

  15. Muresan, S., Sitzmann, M., Southan, C.: Mapping between databases of compounds and protein targets. Methods Mol. Biol. (Clifton, N.J.) 910, 145–164 (2012). https://doi.org/10.1007/978-1-61779-965-5_8

  16. Petrarca, A.E., Lynch, M.F., Rush, J.E.: A method for generating unique computer structural representations of stereoisomers. J. Chem. Doc. 7(3) (1967). https://doi.org/10.1021/c160026a008

  17. RDKit: Open-source cheminformatics software. https://www.rdkit.org/

  18. Sajed, T., et al.: ECMDB 2.0: a richer resource for understanding the biochemistry of e.coli. Nucleic Acids Res. 44(D1), D495–501 (2016). https://doi.org/10.1093/nar/gkv1060

  19. Sitzmann, M., Filippov, I., Nicklaus, M.: Internet resources integrating many small-molecule databases1. SAR QSAR Environ. Res. 19(1–2), 1–9 (2008). https://doi.org/10.1080/10629360701843540

    Article  CAS  PubMed  Google Scholar 

  20. Stein, S.E., Heller, S.R., Tchekhovskoi, D.V.: The IUPAC Chemical Identifier - Technical Manual (2011)

    Google Scholar 

  21. Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988). https://doi.org/10.1021/ci00057a005

  22. Weininger, D.: SMILES. 3. DEPICT. Graphical depiction of chemical structures. J. Chem. Inf. Comput. Sci. 30(3), 237–243 (1990). https://doi.org/10.1021/ci00067a005

  23. Weininger, D., Weininger, A., Weininger, J.L.: SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29(2), 97–101 (1989). https://doi.org/10.1021/ci00062a008

  24. Williams, A.J., Ekins, S.: A quality alert and call for improved curation of public chemistry databases. Drug Discov. Today 16(17), 747–750 (2011). https://doi.org/10.1016/j.drudis.2011.07.007

    Article  CAS  PubMed  Google Scholar 

  25. Williams, A.J., Ekins, S., Tkachenko, V.: Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov. Today 17(13), 685–701 (2012). https://doi.org/10.1016/j.drudis.2012.02.013

    Article  CAS  PubMed  Google Scholar 

  26. Young, D., Martin, T., Venkatapathy, R., Harten, P.: Are the chemical structures in your QSAR correct? QSAR Comb. Sci. 27(11–12), 1337–1345 (2008). https://doi.org/10.1002/qsar.200810084

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work is supported by the Novo Nordisk Foundation grants NNF19OC0057834 and NNF21OC0066551.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Casper Asbjørn Eriksen .

Editor information

Editors and Affiliations

Appendix A

Appendix A

See Fig. 4.

Fig. 4.
figure 4

The identifier graph generated by the BiGG ID M_h20_c. For a description of how to read this graph, see the caption of Fig. 2. This figure clearly demonstrates the impact of the scoring algorithm, as it chooses the conventional structure, \(\mathrm {H_2O}\), at a confidence ratio of 0.07.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Eriksen, C.A., Andersen, J.L., Fagerberg, R., Merkle, D. (2023). Reconciling Inconsistent Molecular Structures from Biochemical Databases. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7074-2_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7073-5

  • Online ISBN: 978-981-99-7074-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics