Glycoscience data content in the NCBI Glycans and PubChem

Kim, Sunghwan; Zhang, Jian; Cheng, Tiejun; Li, Qingliang; Bolton, Evan E.

doi:10.1007/s00216-024-05459-7

Glycoscience data content in the NCBI Glycans and PubChem

Feature Article
Open access
Published: 12 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Analytical and Bioanalytical Chemistry Aims and scope Submit manuscript

Glycoscience data content in the NCBI Glycans and PubChem

Download PDF

619 Accesses
3 Altmetric
Explore all metrics

Abstract

Studying glycans and their functions in the body aids in the understanding of disease mechanisms and developing new treatments. This necessitates resources that provide comprehensive glycan data integrated with relevant information from other scientific fields such as genomics, genetics, proteomics, metabolomics, and chemistry. The present paper describes two resources at the U.S. National Center for Biotechnology Information (NCBI), the NCBI Glycans and PubChem, which provide glycan-related information useful for the glycoscience research community. The NCBI Glycans (https://www.ncbi.nlm.nih.gov/glycans/) is a dedicated website for glycobiology data content at NCBI and provides quick access to glycan-related information scattered across multiple NCBI databases as well as other information resources external to NCBI. Importantly, the NCBI Glycans hosts the official web page for the symbol nomenclature for glycans (SNFG), which is the standard graphical representation of glycan structures recommended for scientific publication. On the other hand, PubChem (https://pubchem.ncbi.nlm.nih.gov) is a research-focused, large-scale public chemical database, containing a substantial number of glycan-containing records and is integrated with important glycoscience resources like GlyTouCan, GlyCosmos, and GlyGen. PubChem organizes glycan-related information within multiple data collections (i.e., Substance, Compound, Protein, Gene, Pathway, and Taxonomy) and provides various tools and services that allow users to access them both interactively through a web browser and programmatically through a REST-ful interface, including PUG-View. The NCBI Glycans and PubChem highlight glycan-related data and improve their accessibility, helping scientists exploit these data in their research.

Graphical Abstract

Databases and Associated Tools for Glycomics and Glycoproteomics

Exploring the UniCarbKB Database

GlycoBase and autoGU: Resources for Interpreting HPLC-Glycan Data

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Glycans play crucial roles in various biological processes [1,2,3,4,5]. As major components of the cell surface, they are involved in cell-cell communications and signaling events. Recognized by microbes (e.g., viruses and bacteria), glycans influence the host-pathogen interaction in infectious disease at various stages from initial colonization to tissue spread and inflammation [6]. In addition, defects in glycan synthesis, metabolism, and recognition are associated with various diseases [7], such as cancer [8,9,10] and congenital disorders of glycosylation (CDGs) [11, 12]. Therefore, studying glycans and their roles in the body helps in understanding disease mechanisms and developing novel therapeutics. This requires resources that provide comprehensive glycan data to be integrated with relevant data in other scientific areas such as genomics, genetics, proteomics, metabolomics, and chemistry.

Since the development of CarbBank [13] in the late 1980s, many glycan-related information resources and coordinated efforts have emerged, including Glycome Atlas [14, 15], the Carbohydrate Structure Databases [16], the carbohydrate-active enzyme (CAZy) database [17], Glyco Epitope [18], Total Glycome Database [19], KEGG Glycan [20], GlyTouCan [21, 22], GlyComb [23], GlyPOST [24], UniCarb-DR [25], GlyGen [26], Glycomics@ExPASy [27], GlyCosmos Portal [28], GlySpace Alliance [29, 30], and many others. To help users more efficiently use glycan data scattered across many different resources, there have been collaborative efforts to promote data exchange and integration among glycoscience data resources [28,29,30]. An example is the GlyCosmos Portal [28], which integrates glycan data repositories, including GlyTouCan (for glycan structures) [21, 22], Glycomb (for glyconjugate data) [23], GlyPOST (for glycomics mass-spectrometry (MS) raw data) [24], and UniCarb-DR (for glycomics MS peak lists) [25]. The GlyCosmos Portal also integrates glycan data from multiple databases and provides quick access to multi-omics data for glycans. Another example of collaborative efforts in glycoscience is the Glyspace Alliance [29, 30], a tri-continent alliance formed by the teams responsible for three major glycoscience data integration projects: GlyGen (in the USA) [26], Glycomics@ExPASy (in Switzerland) [27], and the GlyCosmos Portal (in Japan) [28]. The GlySpace Alliance aims to provide high-quality glycan-related data, by freely sharing relevant information among its participating members.

The U.S. National Center for Biotechnology Information (NCBI) manages two glycan-related information resources: NCBI Glycans and PubChem [31,32,33,34]. The NCBI Glycans (https://www.ncbi.nlm.nih.gov/glycans/) website, a dedicated glycan information portal at NCBI, provides quick access to glycoscience data content scattered across multiple NCBI databases as well as other databases external to NCBI. PubChem [31,32,33,34] is a large-scale public chemical database and has a substantial amount of glycan-related data, including curated information such as experimental bioactivities of glycans. The present paper describes an overview of these two resources, including data sources and organization as well as tools and services useful to researchers interested in glycoscience content. In addition, it also explains the current limitations of the two resources and the ongoing efforts to address them.

Material and methods

Data sources

PubChem data are organized into multiple data collections [35], including Substance, Compound, BioAssay, Gene, Protein, Pathway, Cell Line, Taxonomy, and Patent. While the Substance and BioAssay collections serve as archives that keep chemical information voluntarily submitted by data sources, the other collections serve as knowledgebases that provide users with easy access to organized high-quality information about PubChem records.

Considering PubChem’s dual role as an archive and as a knowledgebase, its data sources can be broadly classified into two groups: (1) archival data sources, which voluntarily submit their data to PubChem Substance and BioAssay, and (2) annotation sources, from which the PubChem team collects authoritative and curated chemical information to annotate PubChem records. Some sources may belong to both groups.

PubChem has almost one thousand data sources (996 sources as of June 7, 2024). Some of them are popular glycoscience information resources, including GlyTouCan [21, 22], GlyGen [26], GlyCosmos Glycoscience Portal [28], and GlycoNAVI. These resources are used to annotate various types of PubChem records (such as compounds, proteins, genes, pathways, and organisms).

It is noteworthy that GlyTouCan is also an archival data source within PubChem, meaning that its glycan records can be found in PubChem Substance. Currently, there are about 220 thousand substances provided by GlyTouCan and about 40% (84 thousand substances) of them are associated with PubChem Compound records.

Data integration

For the 220 thousand substances from GlyTouCan, their Web3.0 Unique Representation of Carbohydrate Structure (WURCS) strings [36, 37] were used to generate molecular structures in a structure-data file (SDF) format, using MolWURCS (https://gitlab.com/glycoinfo/molwurcs), which is designed to interconvert between WURCS and several other file formats used in molecular modeling, computational chemistry, and related area. The generated structures were processed through chemical structure standardization [38]. If the standardized structure of a glycan already exists in PubChem Compound, the association between the substance and the compound was generated. If the standardized structure did not exist in PubChem Compound, a new compound record was created before the association was generated.

Glycan-related data used to annotate PubChem records were obtained from annotation data sources, including GlyGen, GlyTouCan, GlyCosmos, and GlycoNAVI (see Table 1). These data were mapped to corresponding records in PubChem. The identifiers used for the data mapping were the GlyTouCan IDs for compounds and the NCBI Gene IDs and protein accessions for genes and proteins, respectively. For taxa, the NCBI Taxonomy IDs were used. For pathways, the Reactome IDs were used.

Table 1 Glycan-related annotations in PubChem

Full size table

Data presentation

As mentioned previously, PubChem has multiple data collections. Each record in these collections has a Summary page, which is a dedicated web page that shows all information for that record available in PubChem. The glycan-related annotations for a PubChem record are presented in the appropriate section of the Summary page for that record.

Results and discussion

NCBI Glycans

The NCBI Glycans (https://www.ncbi.nlm.nih.gov/glycans/), launched in September 2017, serves as a central gateway for glycoscience researchers by providing quick access to a wealth of information scattered across multiple information resources. It provides links to two popular online books, “Essentials of Glycobiology” [39] and “Glycoscience Protocols (GlycoPODv2)” [40], freely available at the NCBI Bookshelf. The “Essentials of Glycobiology” is an online textbook for upper-undergraduate and graduate students majoring in life sciences and biomedicine, where it provides a basic overview of glycobiology and covers a wide range of topics, from biology and medicine to chemistry, bioenergy, and material science. The “Glycoscience Protocols” contains free online protocols about experiments commonly performed in glycoscience research. These protocols were provided by the GlycoScience Protocol Online Database (GlycoPOD) (https://jcggdb.jp/GlycoPOD/) of the Japan Consortium for Glycoscience and Glycotechnology (JCGG).

Importantly, the NCBI Glycans hosts the official webpage for the symbol nomenclature for glycans (SNFG) (https://www.ncbi.nlm.nih.gov/glycans/snfg.html) [41, 42], which is the standard graphical representation of glycan structures recommended by the SNFG discussion group (https://www.ncbi.nlm.nih.gov/glycans/snfggroup.html). The SNFG page provides a detailed description of the SNFG standard, along with a list of software tools supporting it. A recent addition to the NCBI Glycans is the webpage for the structural representation of sialic acids (Sia) and other nonulosonic acids (NulOs) using the SNFG (https://www.ncbi.nlm.nih.gov/glycans/sialic.html) [43]. For monosaccharide residues listed on the two web pages, links to corresponding records in PubChem are available, helping users to find additional information about them.

Through the NCBI Glycans, users can readily identify glycan-related records in multiple NCBI resources, including PubMed, PubChem, Structure, Gene, Protein, MedGen, and Bookshelf. The NCBI Glycans also provides the link to some external resources useful for glycoscience research, including the Carbohydrate Structure Databases [16], the carbohydrate-active enzyme (CAZy) database [17], ExPASy Glycomics Resource Page [27], GLYCAM (https://glycam.org/), GlyCosmos Portal [28], GlyGen [26], GlySpace Alliance [29, 30], and GlyTouCan [21, 22].

Glycans in PubChem Substance and Compound

Launched in 2004, PubChem has been a key chemical information resource for scientific communities as well as the general public. While PubChem primarily contains small molecules, it also has many glycan and glycan-containing chemical structure records submitted by various sources. Using the SDF files of compound records in PubChem (from the PubChem FTP site) as inputs, the Sugar & Splice toolkit from NextMove Software (https://www.nextmovesoftware.com/sugarnsplice.html) detected 1,167,038 compound records containing glycan monomers (Additional File 1). For 142,974 compound records, the PubChem definition of a “biologic” was met, meaning that the structure contains recognized biopolymers, the substituents are known, and Sugar & Splice was able to generate a biologic image or to compute at least one of the following biologic properties (LINUCS [44], IUPAC [45], and IUPAC-condensed [45]) (Additional File 2).

Glycan data submission from GlyTouCan and SNFG to PubChem, which started back in 2015, is an important addition to the existing glycan-related data content. As of June 13, 2024, PubChem contains about 220 thousand glycan substances from these two sources. Most of these substances were submitted by GlyTouCan, with 225 substances from SNFG. Among the ~220 thousand glycan substances, ~40% of them are mapped to 84 thousand compound records in PubChem. The Summary page of each of these compounds presents all information available for that compound in PubChem, including annotations collected from authoritative data sources as well as experimentally determined bioactivity data archived in PubChem. Most of the glycan-specific data available within the Compound Summary are shown in the Biologic Description section, as in the following example (for CID 91846437):

https://pubchem.ncbi.nlm.nih.gov/compound/91846437#section=Biologic-Description

This section (Fig. 1) presents the Scalable Vector Graphics (SVG) image and line notations (LINUCS [44], IUPAC [45], and IUPAC-condensed [45]) of the glycan structure, computed by PubChem using NextMove Software’s Sugar & Splice toolkit. The section also shows various kinds of annotations (e.g., WURCS, classification, monosaccharide composition, motif, permethylated and monoisotopic masses), collected from GlyTouCan, GlyGen, and GlyCosmos (see Table 1).

There are glycan-specific annotations that appear in sections other than the Biologic Description. Examples are the GlyTouCan accession (under the Names and Identifiers section) and the Taxonomy information (in the Taxonomy section), which are accessible via the URLs (when using the same example of CID 91846437):

It is noteworthy that the Taxonomy section provides information on the organism and tissue where the glycan can be found, which is also presented as “GlyCosmos Species” and “GlyCosmos Tissue/Organ” in the Biologic Description section. However, the Taxonomy section presents that information with the reference where it was extracted from, helping users cross-check the accuracy of the information (see Fig. 2). In addition, all annotations provided on the Summary page of a PubChem record are presented with their provenance information (e.g., which data submitter provided the content), as shown in the blue box in Fig. 2. Clicking the source name shows additional metadata for the annotation, along with the link to the corresponding record in the data source (the purple box in Fig. 2).

Glycoinformation in PubChem Protein and Gene

The Protein and Gene Summary pages show all available data in PubChem for a given gene or protein. For example, the following URLs are for the Summary pages for human β-1,4-galactosyltransferase 1 (B4GALT1) (NCBI Protein accession P15291) and its encoding gene (NCBI Gene ID 2683):

These pages contain a wide range of information on the protein and gene, including glycan-related annotations from GlyGen and GlyCosmos (see Table 1). The Protein Summary shows the glycosylation information (e.g., the site and type of glycosylation and the article from which the glycosylation data was extracted) (see Fig. 3), as shown in the following example for the B4GALT1 protein.

https://pubchem.ncbi.nlm.nih.gov/protein/P15291#section=Glycosylation

In the Gene Summary page, the gene-chemical and gene-disease interaction information, collected from GlyCosmos, is presented with interaction data from other sources (e.g., Comparative Toxicogenomics Database (CTD) [46] and DrugBank [47]), as shown in these examples:

Importantly, the Protein and Gene Summary pages provide quick access to bioactivity data of chemicals tested against the corresponding protein and gene targets, respectively. As an example, the following URLs lead to the bioactivity data for human B4GALT1 protein and its encoding gene:

These bioactivity data can be downloaded and used to perform a structure-activity relationship analysis or to build a predictive model for bioactivity against the target protein or gene.

Glycoinformation in PubChem Pathway

The Summary page for a given pathway lists chemicals, proteins, and genes involved in or associated with that pathway, along with information on interactions or reactions among them. As an example, the following URL directs users to the Pathway Summary page for the lysosomal oligosaccharide catabolism pathway (Reactome ID R-HAS-8853383).

https://pubchem.ncbi.nlm.nih.gov/pathway/Reactome:R-HSA-8853383

The Pathway Summary also shows a list of glycans, glycoproteins, and glycogenes (when available), integrated from GlyCosmos, as shown in these examples:

Glycoinformation in PubChem Taxonomy

The Taxonomy Summary page for a taxon can be accessed through a URL containing the corresponding NCBI Taxonomy ID. For example, the Taxonomy Summary page for Danio rerio (zebrafish; NCBI Taxonomy ID 7955) can be accessed via the URL:

https://pubchem.ncbi.nlm.nih.gov/taxonomy/7955

Among the wide range of information presented in the Taxonomy Summary is the list of glycans found in the organism, the tissue or organ where the glycans were found, and the links to the articles where the information was extracted from. For example, the following URL leads to the list of glycans associated with zebrafish.

https://pubchem.ncbi.nlm.nih.gov/taxonomy/7955#section=Glycans

In addition, the Taxonomy Summary page shows the whole-organism bioassays performed on the whole taxon without a specific target gene or protein as well as the bioactivity data for the chemicals tested in those assays, as in the following examples:

Whole-organism bioassays for zebrafish

https://pubchem.ncbi.nlm.nih.gov/taxonomy/7955#section=Whole-Organism-BioAssays

Whole-organism bioactivities for zebrafish

https://pubchem.ncbi.nlm.nih.gov/taxonomy/7955#section=Whole-Organism-Bioactivities

Searching for glycan-related data in PubChem

Users can initiate a search of PubChem by providing a keyword query in the search box available on the PubChem home page (Fig. 4). PubChem accepts various types of keyword queries, including chemical names, chemical abstract service (CAS) registry numbers, PubChem record identifiers (SID, CID, and AID), gene/protein names and symbols, and disease names. To search for glycans and glycan-related records, a GlyTouCan Accession may also be used as a query.

When a keyword query is provided, PubChem searches all data collections simultaneously and returns a list of hit records for each collection. PubChem also tries to identify the most relevant record and presents it at the top of the search result page. Clicking one of the hit records will direct the user to its Summary page. In addition, the search result page allows users to perform various tasks, such as downloading the hit records, filtering the hit list based on certain attributes, and combining hit lists from multiple queries. More detailed information on these tasks can be found in our previous papers [32, 33].

Getting glycan-related annotations from a source

PubChem users are often interested in getting all available annotations from a particular source, for example, getting all gene-disease associations from the GlyCosmos Glycoscience Portal. It can be readily done using the PubChem Data Sources page (https://pubchem.ncbi.nlm.nih.gov/source/) (see Fig. 5), which lists all PubChem data sources. This page allows users to search the sources by name and filter them based on the source category, status, and country as well as the type of data from the source. Clicking one of the sources in this list directs to a dedicated web page for that source, which provides information on the source as well as the link to the data originating from the source. Figure 5 shows how to download the gene-disease associations from the GlyCosmos Portal.

Programmatic access

Annotations to PubChem records can be programmatically accessed via PUG-View (https://pubchem.ncbi.nlm.nih.gov/docs/pug-view) [48], which is one of the REST-ful interfaces provided by PubChem. For example, the following PUG-View request URL downloads the annotations presented in the Biologic Description section of the Compound Summary page for CID 91846437.

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/91846437/JSON?heading=Biologic+Description

It is also possible to download a particular kind of annotation for all records from a given source. For example, all glycan classification annotations from GlyGen can be downloaded via the following URL:

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/JSON/?source=GlyGen&heading_type=Compound&heading=GlyGen%20Classification&page=1&response_type=display

It is noteworthy that a PUG-View request returns up to 1000 annotations, although some headings have more annotations than this limit. Therefore, data from PUG-View requests are paginated. At the end of the returned output, the “TotalPages” and “Page” values are included to indicate the total pages of the data and the page number for the returned data. If the “TotalPages” value is greater than 1, there is more data to download. By default, the “Page” value is set to 1, meaning that the first page of the data is returned. To get another page of the data, the “Page” value should be adjusted accordingly. For example, the following request URL retrieves the second page of the glycan classification annotations from GlyGen:

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/JSON/?source=GlyGen&heading_type=Compound&heading=GlyGen%20Classification&page=2&response_type=display

Accessing related glycans with different subsumption levels

The subsumption of a glycan record refers to the level of the structural details known for the glycan (https://glycosmos.org/help). There are five subsumption levels: base composition, monosaccharide composition, glycosidic topology, linkage-defined saccharide, and fully defined saccharide (in the order of increasing structural details) (see Table 2). Figure 6 shows how glycans with different subsumption levels are integrated into PubChem. It is noteworthy that some glycans have no CIDs (i.e., no corresponding records in the Compound collection). While all compound records in PubChem are required to have discrete chemical structures, glycans whose subsumption level is “base composition” or “monosaccharide composition” (e.g., G70264PJ and G71524FT in Fig. 6) do not have sufficient information on the glycosidic linkage (i.e., anomeric configuration and the carbon numbers of the monosaccharides which are linked together). Therefore, they cannot be added to PubChem Compound. On the other hand, the glycans in Fig. 6 whose subsumption level is “glycosidic topology” or “linkage-defined saccharide” have CIDs in PubChem and that they have the same atom connectivity with varying stereochemistry for the glycosidic linkage. In general, only glycans whose subsumption level is “glycosidic topology” or higher can have discrete structures and hence corresponding records in PubChem Compound.

Table 2 Definition of glycan subsumption levels

Full size table

The Glycan Naming and Subsumption Ontology (GNOme) [49] is added to the PubChem Classification Browser, which allows users to retrieve PubChem records that belong to a certain class or have particular annotations.

https://pubchem.ncbi.nlm.nih.gov/classification/#hid=133

After selecting “GNOme” from the “Select classification” dropdown menu, users can navigate the GNOme classification tree to identify the node corresponding to a glycan or glycans with a desired monoisotopic mass value or range. This enables users to retrieve the glycans with the same mass but with different subsumption levels.

Limitations and future directions

As mentioned previously, only 40% of the 220 thousand glycan substance records from GlyTouCan are associated with compound records. The remaining 60% do not have associated compound records, primarily due to substances having a variable (non-discrete or incomplete) aspect. Some glycans do not have sufficient information on glycosidic linkages, yet Compound records do not allow such variability, and therefore, no corresponding compound CID records in PubChem can be associated. However, because these glycans may still have some useful information, it is necessary to provide a way to allow users to access information on them. Therefore, a dedicated web page is being considered for each of the glycans with non-discrete structures. A brief explanation of recent additions of Compound summary records for non-discrete chemical substances in PubChem is given in our recent publication [50].

Many chemical structures in the PubChem Compound database contain glycans. As of June 13, 2024, a total of 1,167,038 records in the PubChem Compound database contain a glycan monomer as determined by the Sugar & Splice software from NextMove Software (Additional File 1). However, there is no definitive definition of “what is a glycan monomer?” Many of the unique glycan monomers within PubChem are synthetically created or synthetically modified from natural glycans. Of these 1.1 M glycan-containing CIDs in PubChem Compound, only 142,974 could have a computed image and/or other computed properties using Sugar & Splice (Additional File 2), which is to say that they more closely resemble a glycan structure than a glycoconjugate or some other glycosylated chemical structure. This suggests that most of the 1.1 M glycan-containing structures in PubChem are part of a larger chemical structure (e.g., a glycoconjugate). The absence of a well-defined definition of “what is a glycan?” or “what is a naturally occurring glycan?” makes it difficult to assess the portion of PubChem that may be of interest to glycan researchers. It also highlights opportunities to assess parts of PubChem (or other chemical database resources) that contain useful glycan-related information that may assist researchers in making new discoveries. It also highlights opportunities for data exchange between chemical databases and glycan-focused data systems. While most information has flowed from glycan-focused resources to chemical databases, it is possible for this exchange to work in the reverse direction, once the scope of the glycan-containing structures can be assessed (e.g., which chemical structures are potentially of importance to researchers working with glycans).

Conclusions

The present paper describes two NCBI resources, the NCBI Glycans and PubChem, which provide glycan-related information useful for the glycoscience research community. The NCBI Glycans provides quick access to various glycoscience data contents, including the two online books, the “Essentials of Glycobiology” and “Glycoscience Protocols,” and the SNFG official web page. It also allows users to identify glycan-related records in NCBI databases, such as PubMed, PubChem, Structure, Gene, Protein, MedGen, and Bookshelf. In addition, the NCBI Glycans provides links to popular glycoscience information resources external to NCBI.

As a large-scale public chemical database, PubChem contains a substantial number of glycans and related data, integrated from important glycoscience resources like GlyTouCan, GlyCosmos, GlyGen, and GlycoNAVI. The glycan-related data in PubChem are organized into multiple data collections (i.e., Substance, Compound, Protein, Gene, Pathway, and Taxonomy) and presented on the Summary page of relevant PubChem records. These records can be searched for using a keyword query (e.g., names or identifiers). The PubChem Classification Browser and Data Sources page are also useful tools to access glycan data in PubChem. It is also possible to programmatically access the glycan data using the PUG-View programmatic service. In the spirit of open data exchange, for all glycan data integrated from external data sources, PubChem provides links back to the corresponding record at the original source’s website (see Fig. 2). This allows users to check the accuracy of the data and find additional information at the data source.

The PubChem team has been actively working with the glycoscience community. In collaboration with the SNFG discussion group, it maintains the SNFG guideline page and the sialic acid nomenclature pages at the NCBI Glycans. As explained in our recent paper [51, 52], the PubChem team collaborates with the developers of other glycoscience databases, such as GlyTouCan, GlyGen, and GlyCosmos, to make glycan-related data comply with the FAIR guiding principle for scientific data management and stewardship [53], where FAIR is an acronym for “findable, accessible, interoperable, and reusable.” These efforts will improve the accessibility of glycan-related information, enabling scientists to exploit them in their research.

Data availability

All PubChem data are freely available to the public.

References

Smith BAH, Bertozzi CR. The clinical impact of glycobiology: targeting selectins, Siglecs and mammalian glycans. Nat Rev Drug Discov. 2021;20(3):217–43. https://doi.org/10.1038/s41573-020-00093-1.
Article PubMed PubMed Central CAS Google Scholar
Johannes L, Shafaq-Zadah M, Dransart E, Wunder C, Leffler H. Endocytic roles of glycans on proteins and lipids. Cold Spring Harb Perspect Biol. 2024;16(1):a041398. https://doi.org/10.1101/cshperspect.a041398.
Shkunnikova S, Mijakovac A, Sironic L, Hanic M, Lauc G, Kavur MM. IgG glycans in health and disease: prediction, intervention, prognosis, and therapy. Biotechnol Adv. 2023;67:108169. https://doi.org/10.1016/j.biotechadv.2023.108169.
Article PubMed CAS Google Scholar
Reggiori F, Gabius H-J, Aureli M, Römer W, Sonnino S, Eskelinen E-L. Glycans in autophagy, endocytosis and lysosomal functions. Glycoconj J. 2021;38(5):625–47. https://doi.org/10.1007/s10719-021-10007-x.
Article PubMed PubMed Central CAS Google Scholar
Kim Y, Hyun JY, Shin I. Multivalent glycans for biological and biomedical applications. Chem Soc Rev. 2021;50(18):10567–93. https://doi.org/10.1039/D0CS01606C.
Article PubMed CAS Google Scholar
Miller NL, Clark T, Raman R, Sasisekharan R. Glycans in virus-host interactions: a structural perspective. Front Mol Biosci. 2021;8:666756. https://doi.org/10.3389/fmolb.2021.666756.
Gao G, Li C, Fan W, Zhang M, Li X, Chen W, et al. Brilliant glycans and glycosylation: Seq and ye shall find. Int J Biol Macromol. 2021;189:279–91. https://doi.org/10.1016/j.ijbiomac.2021.08.054.
Article PubMed CAS Google Scholar
Purushothaman A, Mohajeri M, Lele TP. The role of glycans in the mechanobiology of cancer. J Biol Chem. 2023;299(3): 102935. https://doi.org/10.1016/j.jbc.2023.102935.
Article PubMed PubMed Central CAS Google Scholar
Berois N, Pittini A, Osinaga E. Targeting tumor glycans for cancer therapy: successes, limitations, and perspectives. Cancers. 2022;14(3):645. https://doi.org/10.3390/cancers14030645.
Article PubMed PubMed Central CAS Google Scholar
Sun L, Zhang Y, Li W, Zhang J, Zhang Y. Mucin glycans: a target for cancer therapy. Molecules. 2023;28(20):7033. https://doi.org/10.3390/molecules28207033.
Article PubMed PubMed Central CAS Google Scholar
Chang IJ, He M, Lam CT. Congenital disorders of glycosylation. Ann Transl Med. 2018;6(24):477. https://doi.org/10.21037/atm.2018.10.45.
Article PubMed PubMed Central CAS Google Scholar
Freeze HH, Aebi M. Altered glycan structures: the molecular basis of congenital disorders of glycosylation. Curr Opin Struct Biol. 2005;15(5):490–8. https://doi.org/10.1016/j.sbi.2005.08.010.
Article PubMed CAS Google Scholar
Doubet S, Bock K, Smith D, Darvill A, Albersheim P. The complex carbohydrate structure database. Trends Biochem Sci. 1989;14(12):475–7. https://doi.org/10.1016/0968-0004(89)90175-8.
Article PubMed CAS Google Scholar
Konishi Y, Aoki-Kinoshita KF. The GlycomeAtlas tool for visualizing and querying glycome data. Bioinformatics. 2012;28(21):2849–50. https://doi.org/10.1093/bioinformatics/bts516.
Article PubMed CAS Google Scholar
Yamakawa N, Vanbeselaere J, Chang L-Y, Yu S-Y, Ducrocq L, Harduin-Lepers A, et al. Systems glycomics of adult zebrafish identifies organ-specific sialylation and glycosylation patterns. Nat Commun. 2018;9(1):4647. https://doi.org/10.1038/s41467-018-06950-3.
Article PubMed PubMed Central CAS Google Scholar
Toukach PV, Egorova KS. Carbohydrate structure database merged from bacterial, archaeal, plant and fungal parts. Nucleic Acids Res. 2016;44(D1):D1229–36. https://doi.org/10.1093/nar/gkv840.
Article PubMed CAS Google Scholar
Drula E, Garron ML, Dogan S, Lombard V, Henrissat B, Terrapon N. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022;50(D1):D571–7. https://doi.org/10.1093/nar/gkab1045.
Article PubMed CAS Google Scholar
Okuda S, Nakao H, Kawasaki T. GlycoEpitope: database for carbohydrate antigen and antibody. In: Taniguchi N, Endo T, Hart GW, Seeberger PH, Wong C-H, editors. Glycoscience: biology and medicine. Tokyo: Springer Japan; 2015;267-73. https://doi.org/10.1007/978-4-431-54841-6_27.
Fujitani N, Furukawa J-I, Araki K, Fujioka T, Takegawa Y, Piao J, et al. Total cellular glycomics allows characterizing cells and streamlining the discovery process for cellular biomarkers. Proc Natl Acad Sci. 2013;110(6):2105–10. https://doi.org/10.1073/pnas.1214233110.
Article PubMed PubMed Central Google Scholar
Kanehisa M. KEGG Glycan. In: Aoki-Kinoshita KF, editor. A Practical Guide to Using Glycomics Databases. Tokyo: Springer; 2017. pp. 177–93. https://doi.org/10.1007/978-4-431-56454-6_9.
Fujita A, Aoki NP, Shinmachi D, Matsubara M, Tsuchiya S, Shiota M, et al. The international glycan repository GlyTouCan version 3.0. Nucleic Acids Res. 2021;49(D1):D1529–33. https://doi.org/10.1093/nar/gkaa947.
Article PubMed CAS Google Scholar
Tiemeyer M, Aoki K, Paulson J, Cummings RD, York WS, Karlsson NG, et al. GlyTouCan: an accessible glycan structure repository. Glycobiology. 2017;27(10):915–9. https://doi.org/10.1093/glycob/cwx066.
Article PubMed PubMed Central CAS Google Scholar
Takahashi Y, Shiota M, Fujita A, Yamada I, Aoki-Kinoshita KF. GlyComb: a novel glycoconjugate data repository that bridges glycomics and proteomics. J Biol Chem. 2024;300(2):105624. https://doi.org/10.1016/j.jbc.2023.105624.
Article PubMed PubMed Central CAS Google Scholar
Watanabe Y, Aoki-Kinoshita KF, Ishihama Y, Okuda S. GlycoPOST realizes FAIR principles for glycomics mass spectrometry data. Nucleic Acids Res. 2020;49(D1):D1523–8. https://doi.org/10.1093/nar/gkaa1012.
Article PubMed Central Google Scholar
Rojas-Macias MA, Mariethoz J, Andersson P, Jin C, Venkatakrishnan V, Aoki NP, et al. Towards a standardized bioinformatics infrastructure for N- and O-glycomics. Nat Commun. 2019;10(1):3275. https://doi.org/10.1038/s41467-019-11131-x.
Article PubMed PubMed Central CAS Google Scholar
York WS, Mazumder R, Ranzinger R, Edwards N, Kahsay R, Aoki-Kinoshita KF, et al. GlyGen: computational and informatics resources for Glycoscience. Glycobiology. 2020;30(2):72–3. https://doi.org/10.1093/glycob/cwz080.
Article PubMed CAS Google Scholar
Mariethoz J, Alocci D, Gastaldello A, Horlacher O, Gasteiger E, Rojas-Macias M, et al. Glycomics@ExPASy: bridging the gap. Mol Cell Proteomics. 2018;17(11):2164–76. https://doi.org/10.1074/mcp.RA118.000799.
Article PubMed PubMed Central CAS Google Scholar
Yamada I, Shiota M, Shinmachi D, Ono T, Tsuchiya S, Hosoda M, et al. The GlyCosmos Portal: a unified and comprehensive web resource for the glycosciences. Nat Methods. 2020;17(7):649–50. https://doi.org/10.1038/s41592-020-0879-8.
Article PubMed CAS Google Scholar
Lisacek F, Tiemeyer M, Mazumder R, Aoki-Kinoshita KF. Worldwide glycoscience informatics infrastructure: the GlySpace Alliance. JACS Au. 2022;3(1):4–12. https://doi.org/10.1021/jacsau.2c00477.
Article PubMed PubMed Central CAS Google Scholar
Aoki-Kinoshita KF, Lisacek F, Mazumder R, York WS, Packer NH. The GlySpace Alliance: toward a collaborative global glycoinformatics community. Glycobiology. 2020;30(2):70–1. https://doi.org/10.1093/glycob/cwz078.
Article PubMed CAS Google Scholar
Kim S, Chen J, Cheng TJ, Gindulyte A, He J, He SQ, et al. PubChem 2023 update. Nucleic Acids Res. 2023;51(D1):D1373–80. https://doi.org/10.1093/nar/gkac956.
Article PubMed Google Scholar
Kim S. Exploring Chemical Information in PubChem. Current Protocols. 2021;1(8):e217. https://doi.org/10.1002/cpz1.217.
Article PubMed PubMed Central CAS Google Scholar
Kim S, Bolton EE. PubChem: a large-scale public chemical database for drug discovery. Open Access Databases Datasets Drug Discover 2024;39–66. https://doi.org/10.1002/9783527830497.ch2.
Sayers Eric W, Beck J, Bolton Evan E, Brister JR, Chan J, Comeau Donald C, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2024;52(D1):D33–43. https://doi.org/10.1093/nar/gkad1044.
Article PubMed CAS Google Scholar
Kim S, Cheng TJ, He SQ, Thiessen PA, Li QL, Gindulyte A, et al. PubChem Protein, Gene, Pathway, and Taxonomy data collections: bridging biology and chemistry through target-centric views of PubChem data. J Mol Biol. 2022;434(11):167514. https://doi.org/10.1016/j.jmb.2022.167514.
Article PubMed PubMed Central CAS Google Scholar
Matsubara M, Aoki-Kinoshita KF, Aoki NP, Yamada I, Narimatsu H. WURCS 2.0 update to encapsulate ambiguous carbohydrate structures. J Chem Inf Model. 2017;57(4):632–7. https://doi.org/10.1021/acs.jcim.6b00650.
Article PubMed CAS Google Scholar
Tanaka K, Aoki-Kinoshita KF, Kotera M, Sawaki H, Tsuchiya S, Fujita N, et al. WURCS: the Web3 Unique Representation of Carbohydrate Structures. J Chem Inf Model. 2014;54(6):1558–66. https://doi.org/10.1021/ci400571e.
Article PubMed CAS Google Scholar
Hähnke VD, Kim S, Bolton EE. PubChem chemical structure standardization. J Cheminform. 2018;10:36. https://doi.org/10.1186/s13321-018-0293-8.
Article PubMed PubMed Central CAS Google Scholar
Varki A, Cummings RD, Esko JD, Stanley P, Hart GW, Aebi M, et al. Essentials of glycobiology [Internet]. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK579918/.
Nishihara S, Angata K, Aoki-Kinoshita KF, Hirabayashi J. Glycoscience Protocols (GlycoPODv2). Saitama (Japan): Japan Consortium for Glycobiology and Glycotechnology; 2021. Available from: https://www.ncbi.nlm.nih.gov/books/NBK593839/.
Varki A, Cummings RD, Aebi M, Packer NH, Seeberger PH, Esko JD, et al. Symbol nomenclature for graphical representations of glycans. Glycobiology. 2015;25(12):1323–4. https://doi.org/10.1093/glycob/cwv091.
Article PubMed PubMed Central CAS Google Scholar
Neelamegham S, Aoki-Kinoshita K, Bolton E, Frank M, Lisacek F, Lütteke T, et al. Updates to the symbol nomenclature for glycans guidelines. Glycobiology. 2019;29(9):620–4. https://doi.org/10.1093/glycob/cwz045.
Article PubMed PubMed Central CAS Google Scholar
Lewis AL, Toukach P, Bolton E, Chen X, Frank M, Lütteke T, et al. Cataloging natural sialic acids and other nonulosonic acids (NulOs), and their representation using the Symbol Nomenclature for Glycans. Glycobiology. 2023;33(2):99–103. https://doi.org/10.1093/glycob/cwac072.
Article PubMed PubMed Central CAS Google Scholar
Bohne-Lang A, Lang E, Förster T, von der Lieth CW. LINUCS: LInear Notation for Unique Description of Carbohydrate Sequences. Carbohydr Res. 2001;336(1):1–11. https://doi.org/10.1016/s0008-6215(01)00230-0.
Article PubMed CAS Google Scholar
McNaught AD. Nomenclature of carbohydrates (IUPAC Recommendations 1996). Pure Appl Chem. 1996;68(10):1919–2008. https://doi.org/10.1351/pac199668101919.
Article CAS Google Scholar
Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly Carolyn J. Comparative Toxicogenomics Database (CTD): update 2023. Nucleic Acids Res. 2023;51(D1):D1257–62. https://doi.org/10.1093/nar/gkac833.
Article PubMed CAS Google Scholar
Knox C, Wilson M, Klinger Christen M, Franklin M, Oler E, Wilson A, et al. DrugBank 6.0: the DrugBank Knowledgebase for 2024. Nucleic Acids Res. 2024;52(D1):D1265–75. https://doi.org/10.1093/nar/gkad976.
Article PubMed Google Scholar
Kim S, Thiessen PA, Cheng TJ, Zhang J, Gindulyte A, Bolton EE. PUG-View: programmatic access to chemical annotations integrated in PubChem. J Cheminform. 2019;11:56. https://doi.org/10.1186/s13321-019-0375-2.
Article PubMed PubMed Central Google Scholar
Zhang W, Edwards NJ. GNOme – Glycan Naming and Subsumption Ontology. In: Hastings J, Barton A, editors. International Conference on Biomedical Ontologies (ICBO) 2021; September 16-18, 2021; Bozen-Bolzano, Italy: CEUR Workshop Proceedings; 2021;89–93. https://ceur-ws.org/Vol-3073/paper11.pdf.
Kim S, Yu B, Li Q, Bolton EE. PubChem synonym filtering process using crowdsourcing. J Cheminform. 2024;16:69. https://doi.org/10.1186/s13321-024-00868-3.
Article PubMed PubMed Central Google Scholar
Cheng TJ, Ono T, Shiota M, Yamada I, Aoki-Kinoshita KF, Bolton EE. Bridging glycoinformatics and cheminformatics: integration efforts between GlyCosmos and PubChem. Glycobiology. 2023;33(6):454–63. https://doi.org/10.1093/glycob/cwad028.
Article PubMed PubMed Central CAS Google Scholar
Navelkar R, Owen G, Mutherkrishnan V, Thiessen P, Cheng T, Bolton E, et al. Enhancing the interoperability of glycan data flow between ChEBI. PubChem and GlyGen Glycobiology. 2021;31(11):1510–9. https://doi.org/10.1093/glycob/cwab078.
Article PubMed CAS Google Scholar
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1): 160018. https://doi.org/10.1038/sdata.2016.18.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We appreciate our fruitful collaboration with the NextMove Software, the SNFG discussion group, and the Glycan page advisory group. We also thank the PubChem data contributors as well as the Glycoscience community, especially the GlySpace Alliance members.

Funding

Open access funding provided by the National Institutes of Health. This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health.

Author information

Authors and Affiliations

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Sunghwan Kim, Jian Zhang, Tiejun Cheng, Qingliang Li & Evan E. Bolton

Authors

Sunghwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Qingliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Bolton
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SK drafted the manuscript and reviewed the glycan data and tools and services mentioned in the paper. JZ maintained the NCBI Glycans website. TC integrated glycan data from external resources to PubChem. QL developed a dedicated web page for non-discrete chemical structures. EB conceived and supervised the project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Evan E. Bolton.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Source of biological material

Not applicable.

Statement of animal welfare

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Published in the topical collection featuring Current Progress in Glycosciences and Glycobioinformatics with guest editors Joseph Zaia and Kiyoko F. Aoki-Kinoshita.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary File 1

A list of CIDs of the 1,167,038 compounds whose structure contains a glycan monomer unit, detected by the Sugar & Splice toolkit from NextMove Software (GZ 2.77 MB)

Supplementary File 2

A list of CIDs of the 142,974 compounds for which the Sugar & Splice toolkit could generate a biologic image or compute at least one of the biologic properties (LINUCS, IUPAC, and IUPAC-condensed) (GZ 3.14 MB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, S., Zhang, J., Cheng, T. et al. Glycoscience data content in the NCBI Glycans and PubChem. Anal Bioanal Chem (2024). https://doi.org/10.1007/s00216-024-05459-7

Download citation

Received: 15 June 2024
Revised: 11 July 2024
Accepted: 15 July 2024
Published: 12 August 2024
DOI: https://doi.org/10.1007/s00216-024-05459-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Glycoscience data content in the NCBI Glycans and PubChem

Abstract

Graphical Abstract

Similar content being viewed by others

Databases and Associated Tools for Glycomics and Glycoproteomics

Exploring the UniCarbKB Database

GlycoBase and autoGU: Resources for Interpreting HPLC-Glycan Data

Introduction

Material and methods

Data sources

Data integration

Data presentation

Results and discussion

NCBI Glycans

Glycans in PubChem Substance and Compound

Glycoinformation in PubChem Protein and Gene

Glycoinformation in PubChem Pathway

Glycoinformation in PubChem Taxonomy

Searching for glycan-related data in PubChem

Getting glycan-related annotations from a source

Programmatic access

Accessing related glycans with different subsumption levels

Limitations and future directions

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Source of biological material

Statement of animal welfare

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary File 1

Supplementary File 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation