Cellular and Molecular Life Sciences

, Volume 67, Issue 16, pp 2749–2772

Bioinformatics and molecular modeling in glycobiology

Open AccessReview

DOI: 10.1007/s00018-010-0352-4

Cite this article as:
Frank, M. & Schloissnig, S. Cell. Mol. Life Sci. (2010) 67: 2749. doi:10.1007/s00018-010-0352-4

Abstract

The field of glycobiology is concerned with the study of the structure, properties, and biological functions of the family of biomolecules called carbohydrates. Bioinformatics for glycobiology is a particularly challenging field, because carbohydrates exhibit a high structural diversity and their chains are often branched. Significant improvements in experimental analytical methods over recent years have led to a tremendous increase in the amount of carbohydrate structure data generated. Consequently, the availability of databases and tools to store, retrieve and analyze these data in an efficient way is of fundamental importance to progress in glycobiology. In this review, the various graphical representations and sequence formats of carbohydrates are introduced, and an overview of newly developed databases, the latest developments in sequence alignment and data mining, and tools to support experimental glycan analysis are presented. Finally, the field of structural glycoinformatics and molecular modeling of carbohydrates, glycoproteins, and protein–carbohydrate interaction are reviewed.

Keywords

Glyco-bioinformaticsDatabasesCarbohydratesGlycosylationGlycoproteinsMolecular modelingMolecular dynamics simulation

Introduction

The field of glycobiology is concerned with the study of the structures, properties, and biological functions of the family of biomolecules called carbohydrates. These carbohydrates can differ significantly in size ranging from monosaccharides to polysaccharides consisting of many thousands of carbohydrate units. One of the most significant features of carbohydrates is their ability to form branched molecules, which stands in contrast to the linear nature of DNA, RNA, and proteins. Combined with the large heterogeneity of their basic building blocks, the monosaccharides, they exhibit a significantly higher structural diversity than other abundant macromolecules. On the cell surface, carbohydrates (glycans) occur frequently as glycoconjugates, where they are covalently attached to proteins and lipids (aglycons). Glycosylation constitutes the most prevalent of all known post-translational protein modifications. It has been estimated that more than half the proteins in nature are glycoproteins [1]. Carbohydrates (N- or O-glycans) are typically connected to proteins via asparagine (N-linked glycosylation), serine or threonine (O-linked glycosylation). In recent years, it has been shown that glycosylation plays a key role in health and disease and consequently it has gained significant attention in life science research and industry [210].

Databases are playing a significant role in modern life science. It is now unthinkable to design research projects without a prior query or consultation of a few databases. In this respect, bioinformatics provides databases and tools to support glycobiologists in their research. Additionally, high throughput analysis of glycomes can only be handled properly with some sort of automated analysis pipeline that requires extensive bioinformatic support to organize the experimental data generated. In parallel, there are bioinformatic groups actively developing mathematical or statistical algorithms and computational methodologies to analyze the data and thus uncover biological knowledge underlying the biological data. Since valuable experimental data are generated at various locations and in projects that target different scientific questions, the different sources of data have to be connected to generate a more complete data repository, which may aid in gaining a clearer understanding of the functions of carbohydrates in an organism. Consequently, data integration is a prerequisite for improving the efficiency of extraction and analysis of biological information, particularly for knowledge discovery and research planning [11].

Significant improvements in experimental analytical methods over recent years—particularly in glycan analysis by mass spectrometry and high performance separation techniques [1221]—have led to a tremendous increase in the amount of carbohydrate structure data generated. The second source generating new experimental data on a large scale is the increased application of lectin and carbohydrate microarrays to probe the binding preferences of carbohydrates to proteins [2226]. Consequently, the availability of databases and tools to store, retrieve, and analyze these data in an efficient way is of fundamental importance to progress in glycobiology [13, 15, 18, 27, 28]. Although bioinformatics for glycobiology or glycomics (‘glycoinformatics’) [29] is not yet as well established as in the fields of genomics and proteomics [30, 31], over the past few years, there has been a substantial increase in both the development, and use, of informatics tools and databases in glycosciences [3242].

In this review, we will first introduce the various representations of carbohydrates used in the literature, then provide an overview of newly developed databases for glycomics, highlighting briefly the most recent glycoinformatic developments in sequence alignment and data mining, and provide an update [38] on tools to support experimental glycan analysis. Finally, we will review the field of (3D) structural glycoinformatics and molecular modeling of carbohydrates, glycoproteins, and protein–carbohydrate interaction.

Graphical representations of carbohydrate structures

The basic units of carbohydrates are the monosaccharides. Whereas the other fundamental building blocks of biological macromolecules (nucleotides and amino acids) are clearly defined and limited in their number, the situation is much more complex for the carbohydrates. This becomes immediately evident by looking at the ten most frequently occurring monosaccharides in mammals [43]: d-GlcNAc, d-Gal, d-Man, d-Neu5Ac, l-Fuc, d-GalNAc, d-Glc, d-GlcA, d-Xyl, and l-IdoA (Fig. 1). Less than half of them are unmodified hexoses (d-Glc, d-Gal, d-Man) or pentoses (d-Xyl). Most of them are modified or substituted on the parent monosaccharides (deoxy: l-Fuc; acidic: d-GlcA, l-IdoA; substituted: d-GlcNAc, d-GalNAc, d-Neu5Ac). Therefore, derivatization is for monosaccharides the rule rather than an exception. The d-form is more common, but some monosaccharides occur more frequently in their l-form. Additionally, each of them can occur in two anomeric forms (α/β) and two ring forms [pyranose (p)/furanose (f)], which results, for example, in eight forms of cyclic galactose (α-d-Galp, α-l-Galp, β-d-Galp, β-d-Galf, etc.).
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig1_HTML.gif
Fig. 1

Frequently occurring carbohydrate building blocks in mammalia. For each monosaccharide, the 3D structure, the IUPAC short code, and the symbols used in the Oxford (top) and CFG (bottom) symbolic nomenclature are shown. The acids (last row) are displayed with substituents (O-methyl, sulfate). α-l-IdopA is shown in two conformations (1C4 and 2S0) as they appear in heparin (pdb code 1E0O [53]). The formal charge of the monosaccharide at physiological pH is denoted in parentheses

Simplified representations of complex biological macromolecules are frequently used to communicate or encode information on their structure. One-letter codes are in use to encode nucleic acids (5 nucleotides) or proteins (20 amino acids). Since the number of basic carbohydrate units frequently found in mammals is also very limited, symbolic representations [44, 45] (Figs. 1, 2) are frequently in use and one-letter codes have also been proposed [46, 47]. However, more than 100 different monosaccharides are found in bacteria, as has been revealed by a statistical database analysis [48]. This renders the general representation of monosaccharides by one-letter codes unfeasible, and generally longer abbreviations for the monosaccharide residues have to be used. A standardized International Union of Pure and Applied Chemistry (IUPAC) nomenclature for monosaccharides and oligosaccharide chains exists (http://www.chem.qmul.ac.uk/iupac/2carb/) [49], and full names and short codes for the common monosaccharides and derivatives have been defined (e.g., ‘Glc’ for glucose, ‘GlcNAc’ for N-acetylglucosamine). Typically the short names are derived from the trivial names of the monosaccharides (e.g. ‘Fuc’, systematic name: 6-deoxy-galactopyranose; trivial name: fucose). As already shown, in order to define the full monosaccharide short names, the anomeric descriptor, the d/l identifier and the ring form (p/f) have to be given as well, so the shortest name for ‘α-d-glucopyranose’ would be ‘α-d-Glcp’. An example of a more complex monosaccharide is N-acetyl-α-neuraminic acid; short name: α-d-Neup5Ac or α-Neu5Ac (full IUPAC name: 5-acetamido-3,5-dideoxy-d-glycero-α-d-galacto-non-2-ulopyranosonic acid).
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig2_HTML.gif
Fig. 2

Different graphical representations of the N-glycan GlcNAc2Man9. a CFG symbolic representation [45]. b Oxford system [44]. c Chemical drawing. d Extended IUPAC 2D graph representation

The most commonly used graphical and textual representations for carbohydrates are shown in Fig. 2. Each of these shows a different level of information content that is tailored to a particular area of glycoscience research. Glycobiologists will prefer cartoon representations (Fig. 2a, b), whereas synthetic chemists will prefer the ‘chemical’ structural drawings with full atom topology displayed (Fig. 2c). Unfortunately, there are different graphical symbols in use for the same monosaccharides, which is even confusing for scientists working in the field. An agreement on one set of symbols would be very beneficial for the community [50]. From the viewpoint of bioinformatics, the graphical representations are only relevant for structure display in the context of user interfaces. Software tools have been developed that are able to generate on-the-fly cartoon representations from a carbohydrate sequence format (which is a ‘computer representation’ of a carbohydrate structure) [51]. Although carbohydrates have been encoded successfully using the ‘computerized’ extended IUPAC (2D) representation [52] (Fig. 2d), it has been realized that a more flexible and systematic sequence format is required to encode all carbohydrate structures that occur in scientific publications.

Bioinformatic concepts and algorithms

Encoding of carbohydrate structures

There are essentially two possible ways to encode a carbohydrate molecule: as a set of atoms that are connected through chemical bonds, or as a set of building blocks that are connected through linkages. The first approach is used in chemoinformatics and a variety of chemical file formats (e.g., CML [54], InChi [55], SMILES [56]) have been developed for encoding of molecules for storage in chemical databases like PubChem [57] or ChEBI [58]. Figure 3 shows one PubChem entry of sialyl Lewis-X together with additional structural descriptors, like molecular weight. IUPAC full names and InChi and SMILES encoding are computed from the chemical drawing; therefore, encoding of carbohydrates as InChi or auto-generated IUPAC names is possible; however, there are severe limitations. One of the main requirements for databases is to store information in an organized way that facilitates further computational processing. Based on the InChi or IUPAC code, it is, for example, very difficult to derive the monosaccharide composition of sialyl Lewis-X, which is (Neu5Ac)1(Gal)1(GlcNAc)1(Fuc)1. Additionally, since there is more than one entry for sialyl Lewis-X in the PubChem database, the automatic procedures to always generate the same InChi code for the same carbohydrate may need to be improved in order to generate unique IDs for carbohydrates. Although it is possible to develop software that parses InChi codes and assigns knowledge about monosaccharides to a database entry, InChi may not be the first choice for the encoding of carbohydrate structures. Consequently, it would be much more efficient to encode carbohydrates using a residue-based approach similar to the sequences of genes and proteins. However, there are two significant differences: the number of building blocks (residues) may be very large due to frequently occurring modifications of the parent monosaccharides, and the carbohydrate chains frequently contain branches, which means that many carbohydrates are tree-like molecules.
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig3_HTML.gif
Fig. 3

PubChem entry of sialyl Lewis-X together with structural descriptors like IUPAC name and InChi and SMILES code

The prerequisite for a residue-based encoding format is a controlled vocabulary of the residue names. For practical reasons, the number of residues should be kept as small as possible. The main difficulty in encoding monosaccharide names in a systematic way is the definition of clear rules about which atoms of a molecule belong to a monosaccharide residue and which are, for example, of type ‘non-monosaccharide’ (e.g., substituent). The following short list will illustrate the dilemma: Glc, Gal, GlcN, GlcNAc, GalNAc, GlcOAc. For a biologist, all of them would be monosaccharides on their own (glucose, galactose, glucosamine, N-acetylglucosamine, etc.) except GlcOAc, which would be a glucose carrying an ‘acetyl’ substituent. From an encoding point of view, it makes more sense that all entries are of type ‘Glc’ or ‘Gal’ and N, NAc, and OAc are substituents, respectively. Even for bacterial monosaccharides, this would result in a reasonably sized ‘monosaccharide base type’ residue list {Glc, Gal, …} and ‘non-monosaccharide’ residue list {N, NAc, OAc, …}, which is much easier to maintain.

Over the years, each database project that aimed to store carbohydrate structures developed a new sequence format (see Table 1). Because of this, translation tools for the different formats were necessary to establish cross-links between the databases. In the context of the EUROCarbDB design study, which aimed at developing standards for glycoinformatics, a new carbohydrate sequence format for use in databases (GlycoCT) [59] has been established as well as a reference database for monosaccharide notation (http://www.monosaccharidedb.org). The GlycomeDB [60] project uses GlycoCT as a standard format for the integration of all open access carbohydrate structure databases (Fig. 4). Recently, as a result of a close collaboration between developers at the Complex Carbohydrate Research Center (CCRC) and EUROCarbDB, the Glyde-II format (http://glycomics.ccrc.uga.edu/GLYDE-II/) was created where concepts of Glyde [61] and GlycoCT [59] were combined in order to define a new standard exchange format for carbohydrate structures [27].
Table 1

Major carbohydrate structure databases and the sequence formats used

Database

Encoding

URL

GlycomeDB [89]

GlycoCT [59]

http://www.glycome-db.org/

EUROCarbDBa

GlycoCT [59]

http://www.ebi.ac.uk/eurocarb/

CarbBanka [52]

IUPAC extended [90]

http://www.boc.chem.uu.nl/sugabase/carbbank.html

KEGGa [83]

KCF [91]

http://www.genome.jp/kegg/glycan/

GLYCOSCIENCES.dea [82]

LINUCS [92]

http://www.glycosciences.de/

CFGa [84]

Glycominds Linear Code® [47]

http://www.functionalglycomics.org/

BCSDBa [93]

BCSDB linear code

http://www.glyco.ac.ru/bcsdb3/

GlycoSuiteDB [87]

IUPAC condensed [94]

http://glycosuitedb.expasy.org/

GlycoBase (Dublin)a [86]

Motif based

http://glycobase.ucd.ie/

GlycoBase (Lille)a [95]

Linkage path

http://glycobase.univ-lille1.fr/base/

JCGGDB [96]

CabosML [97]

http://jcggdb.jp/

aCurrently queried by GlycomeDB

https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig4_HTML.gif
Fig. 4

GlycomeDB entry of sialyl Lewis-X. The structure is displayed in a ‘human readable’ encoding (CFG cartoons) at the top and in a ‘computer readable’ sequence format (GlycoCT) at the bottom. Additionally, links to external databases and information on structural motifs are available

Algorithms for structural alignment and similarity of carbohydrate structures

For many applications in glycoinformatics, it is required to classify glycans based on occurring structural motifs or patterns, or to compare two carbohydrates and to establish a similarity score. Despite many efforts, this is still an open problem due to the lack of broadly accepted metrics on carbohydrate structures. Algorithms based on adapting the already established sequence alignment approaches from DNA, RNA, and protein sequences to carbohydrates and establishing a scoring matrix for substitutions have been proposed [62, 63]. However, discovery of biomarkers or more broadly extracting discriminating patterns from sets of carbohydrates poses a great challenge due to their branched nature and the possibility that a significant pattern can be fragmented and distributed across multiple branches of the carbohydrate. Traditional pattern discovery and classification techniques from machine learning have been applied with increasing success to meet this challenge. Markov models were used to discover patterns spanning multiple branches [62, 63]. Then the focus shifted to Support Vector Machines and the search for kernels appropriate for branched structures [6466]. Recently, a novel statistical approach for motif discovery that currently outperforms all competing methods has been presented [67].

Glycomics ontologies

The information explosion in biology makes it difficult for researchers to stay up-to-date with current biomedical knowledge and to make sense of the massive amounts of online information. Ontologies are increasingly enabling biomedical researchers to accomplish these tasks [68]. An ontology provides a shared vocabulary, which can be used to model a domain of interest, and it defines the type of objects and concepts that exist, and their properties and relations. Ontologies are often represented graphically as a hierarchical structure of concepts (nodes) that are connected by their relationships (edges). Concepts and relationships are assigned unique ontological names. For example, the ontological concepts ‘carbohydrate’ and ‘molecule’ can be connected by the hierarchical relationship ‘is_a’. In this way, ontologies can be used to provide a formal mechanism to categorize objects by specifying their membership in a specific class. The Glycomics Ontology (GlycO) focuses on the glycoproteomics domain to model the structure and functions of glycans and glycoconjugates, the enzymes involved in their biosynthesis and modification, and the metabolic pathways in which they participate. GlycO is intended to provide both a schema and a sufficiently large knowledge base, which will allow classification of concepts commonly encountered in the field of glycobiology in order to facilitate automated information analysis in this domain [69, 70] (for more information, see the web site http://lsdis.cs.uga.edu/projects/glycomics/).

Predicting the size and diversity of glycomes

A hexasaccharide can, in theory, build 1012 structural isomers [71]. Although such large numbers highlight the intrinsic complexity of carbohydrates, the actual number of carbohydrates in nature is probably significantly smaller. Modeling of enzyme kinetics and estimating the size and diversity of the glycome has been of considerable interest to the field of glycobiology. The first attempt to establish mathematical models of N-type glycosylation occurred over 10 years ago [72], by employing the known enzymatic activities of glycosyltransferases involved in the N-type glycosylation pathway and generating all possible carbohydrates resulting from those activities up to the first galactosylation of an oligosaccharide. This work was then extended [73] by using less simplifying assumptions and extending the set of enzymes included in the model. Recent advances have led to the ability to estimate enzyme reaction rates and enzyme concentrations from mass spectrometry data, thereby opening up the possibility to infer changes to the enzyme concentrations in diseased tissues [74]. Similarly, the expression profiles of glycosyltransferases were used to predict the repertoire of potential glycan structures [75, 76]. A recent estimation of the size of the human glycome based on biosynthetic pathway knowledge approximates the upper limit of distinct carbohydrates to be in the range of hundreds of thousands [77]. Currently, there are about 35,000 distinct carbohydrate structures stored in databases [60]; however, nobody knows how many carbohydrate structures have already been discovered or published so far, since no comprehensive database of carbohydrate structures exists. Statistical analyses of the structures stored in GLYCOSCIENCES.de and the Bacterial Carbohydrate Structure Database (BCSDB) have been performed and showed the differences between available structures from the mammalian and bacterial glycome [48]. The ‘chemical analysis’ of the database entries revealed that about 36 chemical building blocks (=monosaccharide + branching level) would be required for the chemical synthesis of 75% of the mammalian glycans [43]. Unfortunately, the situation is much more complex for the bacterial carbohydrates. However, if one would concentrate on the synthesis of a particular subclass of bacterial carbohydrates (e.g., for vaccine development) the diversity becomes much smaller.

Databases and tools for glycobiology

A variety of databases are available to the glycoscientist [32, 36, 7880]. From the viewpoint of glycoproteins, they can be grouped into databases that contain information on the proteins themselves, databases that store information on the enzymes and pathways that build the glycans, and carbohydrate structure databases [36, 37]. Only limited information is available in databases on glycoforms of glycoproteins.

Carbohydrate structure databases

The complex carbohydrate structure database (CCSD)—often referred to as CarbBank in reference to its query software—was developed and maintained for more than 10 years by the Complex Carbohydrate Research Center of the University of Georgia (USA) [52, 81]. The CCSD was the largest effort during the 1990s to collect the structures of carbohydrates, mainly through retrospective manual extraction from literature. The main aim of the CCSD was to catalog all publications in which complex carbohydrate structures were reported. Unfortunately, funding for the CCSD stopped during the second half of the 1990s and the database was no longer updated. Nevertheless, with almost 50,000 records (from 14,000 publications) relating to approximately 23,500 different carbohydrate sequences, the CCSD is still one of the largest repositories of carbohydrate-related data. Subsets of the CCSD data have been incorporated into almost all recent open access databases, of which the major ones are GLYCOSCIENCES.de (23,233) [82], KEGG Glycan database (10,969) [83], CFG Glycan Database (8,626) [84], Bacterial Carbohydrate Structure Database (BCSDB, 6,789) [85], and GlycoBase (377) [86]. The numbers in parentheses denote the number of distinct carbohydrate sequences (without aglycons) stored in the database (based on GlycomeDB analysis, October 2008). Recently, the JCGGDB, which assembles CabosDB, Galaxy, LipidBank, GlycoEpitope, LfDB, and SGCAL (with 1,490 unique carbohydrate structures) and the GlycoSuiteDB (with about 3,300 unique carbohydrate structures) [87] have become freely accessible as well.

The EUROCarbDB project (http://www.eurocarbdb.org) was a design study that aimed to create the foundations for a new infrastructure of distributed databases and bioinformatics tools where scientists themselves can upload carbohydrate structure-related data. Fundamental ethics of the project were that all data are freely accessible and all provided tools are open source. A prototype of a database application has been developed that can store carbohydrate structures plus additional data such as biological context (organism, tissue, disease, etc.), and literature references. Primary experimental data (MS, HPLC, and NMR) that can serve as evidence or reference data for the carbohydrate structure in question can be uploaded as well (Fig. 5).
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig5_HTML.gif
Fig. 5

EUROCarbDB web interface. a The GlycanBuilder Tool [51] serves as an interface for structure input. Various graphical representations are supported and can be changed interactively. b Result of a structure search in the database. The first three entries contain associated MS data as evidence for the structure

Until recently, there was hardly any direct cross-linking between the established carbohydrate databases [88]. This is mainly due to the fact that the various databases use different sequence formats to encode carbohydrate structures [59] (Table 1). Therefore, the situation in glycoinformatics has been characterized by the existence of multiple disconnected and incompatible islands of experimental data, data resources, and specific applications, managed by various consortia, institutions, or local groups [27, 37]. Importantly, no comprehensive and curated database of carbohydrate structures currently exists. From the user’s point of view, the lack of cross-links between carbohydrate databases means that, until recently, they had to visit different database web portals in order to retrieve all the available information on a specific carbohydrate structure. Additionally, the users might have had to acquaint themselves with the different local query options, some of which require knowledge of the encoding of the residues in the respective database.

In 2005, a new initiative was begun to overcome the isolation of the carbohydrate structure databases and to create a comprehensive index of all available structures with references back to the original databases. To achieve this goal, most structures of the freely available databases were translated to the GlycoCT sequence format [59], and stored in a new database, the GlycomeDB [60]. The integration process is performed incrementally on a weekly basis, updating the GlycomeDB with the newest structures available in the associated databases. During the integration process, some automated checks are performed. Structures that contain errors are reported to the administrators of the original database. A web interface has been developed (http://www.glycome-db.org) as a single query point for all open access carbohydrate structure databases [89] (Fig. 4).

Databases for carbohydrate–protein interaction data

Advances in recent years have led to an explosive growth of data from carbohydrate microarray experiments, coming from multiple research laboratories, each employing their own proprietary technology for spotting the array [22, 98101]. Unfortunately, the often radically different approaches to the spotting of the arrays can change the binding affinities observed. This inhomogeneity in the way the data are generated has caused problems in the comparative analysis and evaluation of the data. Further complications arise from lack of comprehensive applications to manage the data generated in the experiments and also computational approaches to perform analytical studies. There is a clear need in this area for more research and development of computational approaches and tools. Particularly, the standards for reporting glycan array experiments need to be defined [102].

Currently, there is one major public resource for glycan array data provided by the Consortium for Functional Glycomics (CFG) (http://www.functionalglycomics.org). The CFG supported the development of the first robot-produced, publicly available micro-titer-based glycan array. The currently used printed mammalian glycan microarray format (version 4.1) comprises 465 synthetic and natural glycan sequences representing major glycan structures of glycoproteins and glycolipids [103]. In 2008, a pathogen glycan array was also made available for screening, containing 96 polysaccharides derived from Gram-negative bacteria. The protein–glycan interaction core (H) analyzes investigator-generated lectins, antibodies, antisera, microorganisms, or suspected glycan binding proteins (GBP) of human, animal, and microbial origins on the mammalian and pathogen glycan microarrays. Fluorescent reagents are used for detecting primary binding to the glycans on the array. The results of the screening performed by the Core H can be accessed through the consortium web page, and the raw data can be downloaded as an Excel spreadsheet [84] (Fig. 6). The website offers an interactive bar-chart that dynamically displays the glycan structures upon a mouse click on the signal of interest. Links to CFG databases that contain curated information on the GBP and the glycan structures printed on the array are available.
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig6_HTML.gif
Fig. 6

Glycan microarray data provided by the Consortium for Functional Glycomics (CFG). a Web interface providing access to the primary data and related information. b The CFG dataset in the GlyAffinity database. Affinity data from different techniques (FAC, SPR, FP) are also available

A second large resource of carbohydrate–protein interaction data is the Lectin Frontier DataBase (LfDB) provided by the Japanese Consortium for Glycobiology and Glycotechnology (JCGG). A significant part of the data may have been generated as part of the structural glycomics project funded by New Energy and Industrial Technology Organization (NEDO) [104]. In contrast to the CFG glycan microarray database, which provides relative fluorescence units (FU), the LfDB provides affinity constants (Ka) determined by frontal affinity chromatography [24, 105]. Similar to the CFG website, users can navigate to the experimental data using the lectin as an entry point. The data are also presented as an interactive bar-chart (Fig. 7). Since the site is to a large extent in Japanese, it is somewhat difficult to explore the full functionality of the web-interface at the moment.
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig7_HTML.gif
Fig. 7

Lectin Frontier Database. a LfDB web interface providing access to the experimental data and related information. b The LfDB dataset in the GlyAffinity database. Crosslinks via a carbohydrate entry to additional experiments and literature references are shown

Recently, a prototype for an integrated database for protein–carbohydrate interaction, ‘GlyAffinity’, has been developed at the German Cancer Research Center. The database aims at providing a comprehensive repository of curated protein–carbohydrate interaction data from various sources and techniques (Figs. 6b and 7b). The publicly available microarray experiments conducted by the CFG have been acquired and processed. This entailed the parsing of the data, conversion of the carbohydrate structures to the GlycoCT format, and curation of an initial set of approximately 100 experiments for inclusion in the database. The complete contents of the Lectin Frontier DataBase have also been imported and curated, and publications have been scoured for data and manually entered. The Leffler Laboratory (Lund, Sweden) provided access to their primary data, which has been processed and partially imported, and initial steps have been taken to include the data generated by the Feizi Laboratory (London, UK). Access to the data is provided through an interactive web-interface, which offers options to locate data either by lectin or carbohydrate structure. Each lectin entry is classified according to the established hierarchical lectin family scheme [106108] and provides a list of experiments conducted with their experimental conditions and technique. The pages of the individual experiments feature the full list of carbohydrates, their recorded affinity, and the possibility to detect standard motifs.

A current limitation in making full use of protein–carbohydrate interaction data is the lack of systematic analysis methods for extracting information, most importantly the deduction of the binding epitope. Recently, the development of a novel algorithm to detect the occurrence of significant motifs in carbohydrate microarray experiments has been reported [109]. The approach entails the selection of 63 commonly occurring carbohydrate motifs (e.g., Lewis-X, terminal beta-GalNAc, etc.) and processing the complex carbohydrates found on the CFG microarray to detect their presence. Subsequent analysis of the occurrences of a motif in the carbohydrate structures of a particular experiment together with the fluorescence intensity measured yields information about the specificity the lectin exhibits towards a subset of the motifs.

Protein databases

Carbohydrates are frequently covalently attached to or interact non-covalently with proteins, and the biosynthesis of carbohydrates involves the action of hundreds of different enzymes. Consequently, databases that contain information on enzymes, biological pathways, glycoproteins, and lectins are of major interest to a glycobiologist (Table 2). The CAZy database [110] provides a comprehensive repository of carbohydrate active enzymes classified by sequence similarity into distinct families. The CAZy database is highly curated, and its family classification system [111] is frequently used in glycobiology. EC numbers are also annotated in the database, which relates structural features to the observed functions of the enzymes. Cross-links to access 3D structures of the enzymes from the Protein Data Bank (PDB) [112] are also available. BRENDA [113] is broader in its scope and offers information on all known enzymes identified by their EC number. A valuable resource for biochemical pathways is the Kyoto Encyclopedia of Genes and Genomes (KEGG) [83] (http://www.genome.jp/kegg/). KEGG PATHWAY is a database that represents molecular interaction networks, including metabolic pathways, regulatory pathways, and molecular complexes. The integration of carbohydrate structures from the KEGG GLYCAN database into the glycan-related pathways makes it possible to relate carbohydrate structures to genes. A variety of other carbohydrate-related tools are available, which makes KEGG an important informatics resource for glycobiology. The KEGG enzymes are also encoded as EC numbers. Other databases containing information on carbohydrate active enzymes are the GlycoGene Database (GGDB) [114], the CFG glycosyltransferase database [84], and the CERMAV glycosyltransferase database [115]. Curated databases of Glycan Binding Proteins (Lectins) are maintained by the CFG (GBP molecules DB), CERMAV (Lectines DB), and JCGG (Lectin Frontier DB) [96]. More detailed information on animal lectins can be found on the website “A genomics resource for animal lectins” (http://www.imperial.ac.uk/research/animallectins/), which also offers valuable data on lectin families and their roles in recognition processes. GlycoEpitope is an integrated database of carbohydrate antigens and antibodies [116] and is available now in the context of the JCGGDBs. The various databases that provide information on enzymes and lectins frequently provide direct links to entries in the Protein Data Bank if available. The O-GlycBase database [117] offers a curated set of experimentally verified O-glycosylated proteins. GlycoProtDB, which is part of JCGGDB, provides information on glycoproteins from C. elegans N2 and mouse tissues [118]. A very valuable information resource on glycoproteins is GlycoSuiteDB [87] cross-linked with UniProt/SWISS-PROT [119].
Table 2

Protein databases that contain carbohydrate related information (see also [32])

Name

Content

URL

CAZy [110]

Carbohydrate active enzymes

http://www.cazy.org/

BRENDA [113]

Enzymes

http://www.brenda-enzymes.org/

JCGGDB [96]

Glyco genes, glycoproteins, lectin affinity, MS data, epitopes

http://jcggdb.jp/

CFG [84]

Glycans, lectins, glycosyltransferases

http://www.functionalglycomics.org/

Glyco3D [115, 120]

3D structures of carbohydrates, glycosyltransferases and lectins

http://www.cermav.cnrs.fr/glyco3d/

O-GlycBase [117]

Curated set of O-and C-glycosylated proteins

http://www.cbs.dtu.dk/databases/OGLYCBASE/

UniProt/SWISS-PROT [121]

Annotated proteins (lectins, glycoproteins, enzymes)

http://www.uniprot.org/

RCSB Protein Data Bank [112]

3D structures of lectins, glycoproteins, enzymes

http://www.rcsb.org/pdb/

Software tools for glycan analysis

Over the years, the increased application of a variety of methods for glycan analysis have led to the development of many software tools that aim to assist in the interpretation of the experimental data generated. High-throughput methods are mainly based on mass spectrometry (MS) and high performance liquid chromatography (HPLC) due to their sensitivity. Many of these methods rely on reference data of known carbohydrates from databases. Nuclear magnetic resonance (NMR) spectroscopy has always played a key role in the de novo determination of carbohydrate structures [122], and some new tools that aim at decreasing the time needed for NMR peak assignment have been recently developed (Table 3).
Table 3

New free software tools for glycan analysis (see also [38])

Name

Content

URL

GlycoWorkbench [136]

Glycan MS spectra annotation

http://www.eurocarbdb.org/applications/ms-tools

http://www.ebi.ac.uk/eurocarb/gwb/home.action

Glyco-Peakfinder [132]

Determination of glycan compositions from their mass signals

http://www.eurocarbdb.org/applications/ms-tools

http://www.glyco-peakfinder.org/

GlycoPep ID [142]

Identifying the peptide moiety of glycopeptides generated using a nonspecific enzyme

http://hexose.chem.ku.edu/predictiontable.php

GlycoMiner [138]

Glycopeptide composition analysis

http://www.chemres.hu/ms/glycominer/index.php

AutoGU [86]

HPLC analysis

 

ProspectND

NMR spectra processing

http://www.eurocarbdb.org/applications/nmr-tools

CCPN Tools [148]

Annotation of NMR spectra

http://www.ccpn.ac.uk/

CASPER [146]

Structure determination of oligosaccharides and regular polysaccharides

http://www.eurocarbdb.org/applications/nmr-tools

http://relax.organ.su.se:8123/eurocarb/casper.action

Automatic interpretation of MS data remains a challenging task due to the technique only being able to report masses of the fragments observed. This information can be used to calculate a set of potential compositions the carbohydrate could have, which then give rise to a set of potential carbohydrates that satisfy the composition constraints. One of the first tools to deduce potential carbohydrate compositions from MS data is the GlycoMod online service [123]. The number of compositions matching a certain mass value scales exponentially with the number of different monomers; therefore, taxonomic and biosynthetic information have to be incorporated in the assignment process so that smart choices can be made based upon a given composition. The Cartoonist tool [124127] uses a set of archetypal structures in combination with a set of rules for their potential modification to generate all types of glycans that could be possibly synthesized by mammalian cells. Archetypes and rules have been compiled by a group of experts and represent the current knowledge on biosynthetic pathways in mammalian organisms. The number of matching structures is thus greatly reduced by avoiding implausible molecules. Cartoonist has been extended over the years, and now uses a well-tested library of structures, scoring routines to establish confidence scores in an annotation, and knowledge of biochemical pathways to aid the interpretation of spectra. Cartoonist is used as an analysis tool for the glycan profiling service of the CFG.

De novo sequencing tools based on MSn fragmentation data are STAT [128], OSCAR [129] StrOligo [130], and GLYCH [131]. Glyco-Peakfinder [132] is a new web-service for the de novo determination of the composition of glycan-derived MS signals independent of the source of spectral data. Library-based sequencing methods—similar to those applied in proteomics—are also applied in glycomics, where experimental peaks of MS2 spectra are compared with online calculated theoretical fragments from user-definable carbohydrate sequences deposited in databases (glyco-fragment mass fingerprinting). Such an approach is supported by the GlycoSearchMS service [133], where Glyco-Fragment [134] is used to calculate fragment-libraries of all carbohydrates contained in GLYCOSCIENCES.de. GlycosidIQ™ [135] is a similar mass fingerprinting tool developed for interpretation of oligosaccharide mass spectrometric fragmentation based on matching experimental data with theoretically fragmented oligosaccharides generated from the database GlycoSuiteDB [87]. However, the success of such an approach depends on the comprehensiveness of experimentally determined glycan structures included in the database. GlycoWorkbench [136], one of the most recent tools for the computer-assisted annotation of mass spectra of glycans, has been developed as an open source project in the context of the EUROCarbDB project. The main task of GlycoWorkbench is to evaluate a set of structures proposed by the user by matching the corresponding theoretical list of fragment masses against the list of peaks derived from the spectrum. For annotation, GlycoWorkbench uses a database of carbohydrate structures derived from GLYCOSCIENCES.de, CarbBank, and CFG glycan. The tool provides an easy to use graphical interface, a comprehensive and increasing set of structural constituents, an exhaustive collection of fragmentation types, and a broad list of annotation options. Mass spectra annotated with GlycoWorkbench can be uploaded into the EUROCarbDB MS database. GlyQuest [137] and GlycoMiner [138] are new tools that support high-throughput composition and primary structure determination of N-glycans attached to peptides, based on CID (collision induced dissociation) MS/MS (tandem mass spectrometric data). Also new is SysBioWare [139], a general software platform for carbohydrate assignment based on MS data. More specialized applications have been reported for the analysis of glycosaminoglycans [140], in silico fragmentation of peptides linked to N-glycans [141], and identifiying the peptide moiety of sulfated or sialylated carbohydrates [142]. Of further interest for bioinformatic developers might be the OpenMS [143] initiative that provides an open source framework for mass spectrometry and TOPPView [144], an open source viewer for MS data.

The development of robust high-performance liquid chromatography (HPLC) technologies continues to improve the detailed analysis and sequencing of glycan structures released from glycoproteins. In the context of the EUROCarbDB project, an analytical tool (autoGU) [86] was developed to assist in the interpretation, assignment, and annotation of HPLC-glycan profiles. AutoGU assigns provisional structures to each integrated HPLC peak and, when used in combination with exoglycosidase digestions, progressively assigns each structure automatically based on the footprint data. The software is assisted by GlycoBase, a relational database originally developed at the Oxford Glycobiology Institute, which contains the HPLC elution positions for over 350 2-AB labelled N-glycan structures together with digestion pathways. The system is suitable for automated analysis of N-linked sugars released from glycoproteins and allows detection of the carbohydrates at femtomolar concentrations [18].

ProSpectND is an advanced integrated NMR data processing and inspection tool, originally developed at the University of Utrecht and refined during the EUROCarbDB project. It allows batch processing of spectra simulations and automated graphics generation. CASPER is a tool for calculating chemical shifts of oligo- and polysaccharides as well as N- and O-glycans. The tool already has a long history [145, 146] and has recently received a major upgrade. CASPER can also be used for determining the primary structures of carbohydrates by simulating spectra of a set of possible structures and comparing them with the supplied experimental data to find the best match. The Collaborative Computing Project for the NMR community (CCPN) has developed a powerful data model for NMR experiments [147] and an assignment package for NMR spectra of proteins and peptides (CcpNmr Analysis) [148]. As a result of an intensive EUROCarbDB/CCPN collaboration, support for (branched) carbohydrates was added to the CCPN data model and CcpNmr Analysis software. Additionally, CASPER can be used to automatically assign NMR signals to carbohydrate atoms in connection with CcpNmr Analysis. CCPN project files can be directly uploaded into the EUROCarbDB NMR database.

Prediction and statistical analysis of glycosylation sites

Glycosylation is the most common post-translational modification of proteins [1]. Initial analyses yielded a consensus sequence motif for N-type glycosylation, Asn-X-Ser/Thr, with any amino acid at X except proline. Every N-type glycosylation site adheres to this motif, but its sole presence in the amino acid sequence is only a necessity and not sufficient to predict the presence of a glycan. The situation becomes further complicated when one considers mucin-type or other types of O-glycosylation where the glycan is usually attached to a serine or threonine. Research in this area over the last 10 years has resulted in a series of approaches for the prediction of glycosylation sites (Table 4). All strategies for the prediction of glycosylation sites are of a statistical nature: NetCGlyc [149], NetNGlyc [150], NetOGlyc [151], and YinOYang [150] all use neural networks for the prediction of glycosylation sites; big-Pi [152] employs scoring functions based on amino acid properties; GPI-SOM [153] uses a Kohonen map; CKSAAP_OGlySite [154], and EnsembleGly [155] use a Support Vector Machine based approach; and GPP [156], the currently best performing predication tool, uses a hybrid combinatorial and statistical learning approach based on random forests. Training datasets for the statistical learning approaches are usually derived from the PDB or O-GLYCBASE [117].
Table 4

Tools for prediction and analysis of glycosylation sites

Name

Description

URL

Big-Pi [152]

GPI-anchors

http://mendel.imp.ac.at/sat/gpi/gpi_server.html

GPI-SOM [153]

GPI-anchors

http://gpi.unibe.ch/

NetCGlyc [149]

C-mannosylation

http://www.cbs.dtu.dk/services/NetCGlyc/

NetNGlyc [150]

N-glycosylation

http://www.cbs.dtu.dk/services/NetNGlyc/

NetOGlyc [151]

O-glycosylation

http://www.cbs.dtu.dk/services/NetOGlyc/

YinOYang [150]

O-beta-GlcNAc-ylation

http://www.cbs.dtu.dk/services/YinOYang/

EnsembleGly [155]

O-, N- and C-glycosylation

http://turing.cs.iastate.edu/EnsembleGly/

CKSAAP_OGlySite [154]

Mucin-type O-glycosylation

http://bioinformatics.cau.edu.cn/zzd_lab/CKSAAP_OGlySite/

GPP [156]

O- and N-glycosylation

http://comp.chem.nottingham.ac.uk/glyco/

GlySeq [159]

Statistical analysis of glycosylation sites based on sequence

http://www.dkfz.de/spec/glycosciences.de/tools/glyseq/

GlyVicinity [159]

Statistical analysis of glycosylation sites based on 3D structures

http://www.dkfz.de/spec/glycosciences.de/tools/glyvicinity/

The mechanism of acceptor site selection for the covalent attachment of carbohydrates by a series of glycosyl transferases in the case of C-type and O-type glycosylation, and by the oligosaccharyltransferase (OST) for N-type glycosylation, is still not completely understood. This has led to studies on the statistical and structural properties of amino acids in the neighborhood of glycosylation sites. Conformational and statistical properties of N-type glycosylation sites were analyzed based on glycosylated proteins found in the PDB [157]. Statistical properties of O-type glycosylation were analyzed based on entries of the O-GLYCBASE [158]. The GlySeq and GlyVicinity online services [159] allow for interactive exploration of the statistical and conformational properties of glycosylation sites and their surroundings.

Carbohydrate 3D structures and molecular modeling

Complex carbohydrates represent a particularly challenging class of molecules in terms of describing their three-dimensional (3D) structure. Due to their inherent flexibility, these molecules very often exist in solution as an ensemble of conformations rather than as a single well-defined structure. Traditionally, NMR methods, especially Nuclear Overhauser Effect (NOE) measurements, have been widely used to study oligosaccharide conformation in solution [160, 161]. Unfortunately, many oligosaccharide NOEs cannot be resolved or are difficult to assign. Additionally, there are often too few inter-residue NOEs to make an unambiguous 3D structure determination possible. In general, the interpretation of structural experimental data frequently needs to be supported by molecular modeling methods [162, 163]. One of the main aims of computer modeling of carbohydrates is to generate reasonable 3D models that can be used to rationalize experimentally derived observations. Conformational analysis by computational methods consequently plays a key role in the determination of 3D structures of complex carbohydrates.

In recent years, a variety of modeling methods have been applied to the conformational analysis of carbohydrates [164]. Of these, the calculation of conformational maps for disaccharides using systematic search methods, and molecular dynamics (MD) simulations of oligosaccharides in explicit solvent, are by far the most popular methods in modeling of carbohydrate 3D structures [165168]. Although quantum mechanics (ab initio) methods are used for modeling of carbohydrate conformation, these methods are still computationally too demanding to be used routinely to study or predict the 3D structure of complex carbohydrates. Quantum mechanics methods are mainly used to study chemical reactions [169171], to calculate force constants and atom charges to be used as force field parameters [172, 173], or for the conformational analysis of smaller carbohydrates [174176].

In general, molecular modeling methods are applied in glycobiology at various levels of required expertise and computer equipment. Simple model building using a molecular builder will already give one valuable insight: complex carbohydrates look in 3D very different from the impression one gets by looking at chemical drawings or cartoon representations. Also, the simple overlay in 3D of a new carbohydrate ligand onto an existing one in a crystal structure can give valuable first insight into a possible binding mode. However, one has to be aware of the limitations of such basic modeling approaches: molecular builders generate one reasonable conformation out of many; and manually overlaying two carbohydrates in a binding site is a strong bias towards one predefined binding mode and alternative binding modes are unlikely to be discovered. The other extremes would be to perform a complete conformational analysis of a complex carbohydrate based on extensive MD simulations in explicit solvent, which may take weeks of calculation time, and GBytes of simulation data need to be analyzed afterwards, or to screen the complete protein surface for carbohydrate binding sites using extensive dockings based on genetic search algorithms. Recently, even high-level Car-Parinello-based ab initio MD simulations combined with metadynamics simulation have been applied to carbohydrates [170, 177, 178]. The question which modeling method would work best for solving a specific scientific problem is not always straightforward to answer. For example, if exploring the accessible conformational space of a carbohydrate is of major interest then the MD simulation could also be performed in gas phase at higher temperatures instead of running an extensive MD simulation in explicit solvent at room temperature. However, for some systems, the use of explicit solvent is necessary,while for others, one would reach the same conclusions based on gas phase simulations, but with much less computational cost. As is the case with ‘experimental’ methods, ‘experience’ is the key to successful application of molecular modeling methods in most cases.

Since the beginning of the 1990s, more and more crystal structures have been reported where carbohydrates are covalently attached to a (glyco) protein or constitute the ligand in a protein–carbohydrate complex [179, 180]. These experimentally determined 3D structures are freely accessible from the Protein Data Bank (PDB) [112]. So, over recent years, the PDB has become a very valuable resource for obtaining conformational properties of carbohydrates [181]. However, it has to be kept in mind that crystals are often grown under non-physiological conditions, and flexible molecules like carbohydrates may change conformation due to strong forces induced by crystallographic packing.

A variety of reviews and book chapters on conformational analysis of carbohydrates have been published and are recommended for further reading [46, 165, 166, 182].

Databases containing 3D structures of carbohydrates

The two major databases where experimentally determined carbohydrate structures are stored are the Cambridge Structural Database (CSD, http://www.ccdc.cam.ac.uk/products/csd/) and the Protein Data Bank (PDB, http://www.rcsb.org/pdb/). Many crystal structures of small oligosaccharides [183] are also accessible through the Glyco3D web interface (http://www.cermav.cnrs.fr/glyco3d/).

The PDB [112] currently contains more than 60,000 3D structures of biomolecules, of which about 4,000 contain carbohydrates [184]. Most of the carbohydrates in the PDB are either connected covalently to a (glyco)protein, or the carbohydrate forms a complex with a lectin, enzyme, or antibody. Isolated carbohydrates are only rarely found in the PDB. When looking at the carbohydrate structures of a PDB entry, one has to keep in mind that frequently only fragments of the original carbohydrates may be resolved (Fig. 8). Additionally, the 3D structures of the carbohydrates in the PDB do not always meet high quality standards; therefore, one has to look at the structures with care. It has been recognized that, in order to improve the quality of the 3D structures contained in the PDB, theoretical validation procedures for carbohydrates have to be established [184186]. Despite these limitations, the PDB is an important source of information on carbohydrate 3D structures [187]. Unfortunately, due to the lack of a consistently used nomenclature for carbohydrates in PDB files, it is difficult to find the entries of interest. To overcome this problem, the GLYCOSCIENCES.de web portal [187] and the Glycoconjugate Data Bank: Structures (http://www.glycostructures.jp) [188] offer convenient ways to search for carbohydrate structures in the PDB.
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig8_HTML.jpg
Fig. 8

N-Glycosylation site of Phanerochaete chrysosporium Laminarinase 16A (pdb code 2W52 [189]). Although the resolution of the X-ray crystal structure is rather high (1.56 Å), parts of the N-glycan are not visible in the electron density due to the high flexibility of the branch linked to position 3 of the core mannose

Statistical analysis of structural parameters of the carbohydrates present in the PDB entries can be performed with the tools GlySeq, GlyVicinity, GlyTorsion, and carp (http://www.glycosciences.de/tools) [159]. GlySeq checks the type of amino acids that are in the sequence neighborhood of N- and O-glycosylation sites, GlyVicinity performs an analysis of the population of amino acids in the spatial vicinity of carbohydrate residues, and GlyTorsion provides access to the torsion angles of the glycosidic linkages. Using the carp tool (CArbohydrate Ramachandran Plot), these torsions can be compared to theoretical Ramachandran-type conformational maps stored in GlycoMapsDB [185]. This service can be used by crystallographers to cross-check or validate the carbohydrate 3D structures similar to the Ramachandran plot analysis that is routinely used to evaluate the backbone torsions of protein structures.

Molecular modeling of carbohydrates over the Internet

Easy-to-use and freely available Web-based tools [32, 190] are available to generate an initial model of a carbohydrate 3D structure. SWEET-II [191] is a frequently used carbohydrate 3D builder that is available on the GLYCOSCIENCES.de [82] website, which also provides the GlyProt [192] tool for in silico glycosylation of proteins derived from the PDB. A very nice builder for carbohydrates and glycoproteins is also available at the GLYCAM website (http://www.glycam.com). The first molecular builder for carbohydrates was probably POLYS [193] and the latest development in this field is FSPS (fast sugar structure prediction software) [194]. However, it is unclear whether POLYS or FSPS are available to the scientific community via a website or for download. A web-portal to perform MD simulations of carbohydrates over the Internet [195] was recently shut down because the glycoinformatics group that maintained the service was closed [42].

Molecular dynamics simulation

Despite the significant limitations that still exist, the use of molecular dynamics (MD) simulations has turned out to be an excellent methodology to study the conformational properties of carbohydrates and other biomolecules [163, 196199]. Although quantum mechanics-based MD simulations have recently become feasible, most applications of MD are still based on force fields. The development of carbohydrate force fields in itself is a challenging task and is still in progress [172, 173, 200]. Because carbohydrates are polar molecules, the proper treatment of atom charges is likely to be of significant importance particularly for modeling of intermolecular interactions [201]. The discussion about including extra terms for (exo) anomeric effects into force fields has a long tradition in carbohydrate modeling [202205]. The solvent model used for the MD simulation also has a significant effect on the results [206]. In order to make the outcome of an MD simulation more reliable, the theoretical results should always be compared to experimental results if possible. It has to be kept in mind that disagreement between computational and experimental results does not necessarily mean that the force field used is inappropriate for the simulation of carbohydrate structures. It can also mean that other simulation parameters used are not appropriate (e.g., simulation time, solvent model) or possibly that there is a significant error in the experimental results themselves. However, experimental results are very important for validating the quality of theoretical calculations. In recent years, MD simulations have been used to study conformations of complex carbohydrates [206209], glycolipids [200, 210, 211], glycopeptides [212], glycoproteins [213, 214], protein–carbohydrate complexes [215217], protein–glycopeptide interaction [218], carbohydrate–ion interaction [219], and carbohydrate–water interaction [220].

The MD simulation of a complex oligosaccharide or glycoprotein in a solvent box is computationally very expensive, and CPU time of many weeks or months may be required in order to simulate a timescale of only a few nanoseconds [221] (Fig. 9). The timescales of most of the published MD simulations involving carbohydrates are in the range of up to 50 ns. However, in order to achieve convergence for the rotamer populations of the exocylic C–C torsions, the length of an MD simulation should be longer than 100 ns [206, 222]. Although water models like TIP5P are a better approximation of water, in most MD simulations much simpler water models, like SPC or TIP3P [223, 224], are used because of calculation speed and because most force field parameters have been tailored to these simple models. The trajectory files of an MD simulation are typically many GBytes in size. With the availability of supercomputers terabytes of MD data can be produced easily within a short time. The current bottleneck in the application of MD simulations is therefore hard disk space and the requirement to analyze and interpret very quickly the huge amount of data produced. Analysis tools that are part of the MD software distributions [225227], and which have been mainly developed for the analysis of proteins, are frequently used for the analysis of MD trajectories of carbohydrates. However, because of the limitations of the available tools, glycoscientists tend to develop their own ‘in-house’ analysis software. Recently, ‘Conformational Analysis Tools’ (CAT) [228], a novel software for the analysis of MD trajectories, has been made publicly available. CAT is optimized for the efficient conformational analysis of carbohydrates, glycoproteins, and protein–carbohydrate complexes. In summary, MD simulations provide valuable additional information on the conformational dynamics of the system investigated, which is frequently not available from experimental methods. However, although performing simple MD simulations is straightforward in most cases, the correct setup and interpretation of the results requires expert knowledge, and the limitations of the method have to be taken into account.
https://static-content.springer.com/image/art%3A10.1007%2Fs00018-010-0352-4/MediaObjects/18_2010_352_Fig9_HTML.jpg
Fig. 9

MD simulation of SIV gp120 glycoprotein (M. Frank, unpublished). Complete N-glycans were modeled at 13 glycosylation sites based on the X-ray structure (pdb code 2BF1 [229]). a Solvation shell of a selected N-glycan on the protein surface. b A significant surface area of the protein is shielded by the N-glycans. The molecular system has more than 100,000 atoms (4,832 protein atoms, 3,432 carbohydrate atoms, 30,665 water molecules, 4 chloride ions). Water molecules are not shown for clarity

Modeling protein–carbohydrate interaction

One of the major challenges in molecular modeling at the moment is the development of efficient and accurate methods to estimate the binding affinity of protein–carbohydrate complexes [46]. The application of docking methods to study protein–carbohydrate interaction has lately significantly increased [230236]. Typically, a flexible ligand is docked to a rigid receptor; however, examples are reported where receptor flexibility has been included in the docking protocol [237239]. Next to an efficient searching algorithm, the availability of a robust scoring function is critical for the success of docking [240]. Bridging water molecules and CH–π interactions [241243] play a major role in the interaction of a carbohydrate and a protein. However, for various reasons, it is difficult to include these factors into the scoring functions of available standard docking software (e.g. AutoDOCK) [244]. Therefore, despite the many successful applications of docking methods, there are still significant problems with respect to the correct prediction of carbohydrate binding sites and relative affinities in some cases.

If a 3D structure of a protein–carbohydrate complex is available (e.g. an X-ray structure), MD simulations can be performed to study the local interactions (hydrogen bonding, hydrophobic interactions, water bridges) in more detail [245, 246] or to calculate free energy [247249] and entropy changes upon binding [250]. Since polar OH groups of carbohydrates quite frequently like to bind in areas where water molecules are also found on the protein surface, an investigation of the water binding sites is of particular interest [251253].

Conclusion

From the bioinformatics point of view, carbohydrates are a particularly challenging class of biomolecules. Like nucleic acids or proteins, they are assembled from a set of molecular building blocks; however, due to multiple linkage types and sites, even linear carbohydrates are much more complex. Additionally, complex carbohydrates frequently contain one or more branches, which renders most of the sequence algorithms developed for genes not applicable to carbohydrates, and so more complicated tree-based algorithms have to be developed and applied. As a result, established bioinformatics groups seem to neglect carbohydrates to a large extent, and only a few glycoinformatic pioneers face the challenge to develop computer algorithms for carbohydrate sequences.

Significant improvements in glycan analysis and the application of carbohydrate microarrays in glycomics research have led to a significant increase in the amount of experimental data generated. Unfortunately, because of the lack of an established glycoinformatics infrastructure and standards in the field, each research group or consortium has developed their own storage formats, databases, and tools. This renders data integration and exchange very difficult. In recent years, it has become obvious that this situation needs to be changed in the future, and centrally integrated, curated, and comprehensive databases are required for glycomics, similar to proteomics and genomics [27, 38, 254]. Despite this insight, it is very difficult to establish a global glycoinformatics infrastructure at the moment due to the lack of funding and leadership. This is particularly unfortunate because, in recent years, the field has made significant progress: a standard glycan sequence format (GlycoCT) has been developed; and GlycomeDB integrates globally all carbohydrate structure databases and makes the structures searchable for scientists through one central web-interface. In the context of the EUROCarbDB project, standards, tools, and databases have been developed to store carbohydrate structures, analytical data, biological context, and literature references. Recently, a database prototype (GlyAffinity) that aims at integrating all types of protein–carbohydrate interaction data has been developed; the bioinformatic cores of the Consortium for Functional Glycomics (CFG) and the Japan Consortium for Glycobiology and Glycotechnology (JCGG) are making available a vast amount of experimental data; and the EuroGlycoArrays Consortium has just entered the field and will provide more experimental data for the community. Last but not least, in recent years, people working in the field have really started to talk to each other, which is an important catalyst for establishing bioinformatics standards and fueling data integration.

Over the years, established methods in structural glycobiology, like X-ray crystallography, NMR, and molecular modeling, have provided valuable insights into the three-dimensional structures of carbohydrates, glycoproteins and protein–carbohydrate complexes. This has further improved our understanding of the functions of glycans, and may help to design better enzymes, drugs, or vaccines [10, 255258]. It has been realized within the PDB consortium [112] that the data representation and validation of carbohydrates in the PDB needs to be revised. Working groups have been established recently to develop a new format for representing carbohydrates in the PDB, and to recommend new computational tools to be developed [259], as well as to develop standards for glycomics databases and experimental reporting [260].

For a long time glycobiology has been the cinderella field in life sciences: an area that involves much work but, does not get to show off at the ball with her cousins, the genomes and proteins” [261]. This has changed dramatically over the last 10 years. Large collections of new glycomics data are available that are ready to be integrated into the large data collections of proteomics and genomics. Bioinformatics standards for glycomics have been established and, in the context of the EUROCarbDB project, strategies and concepts for data sharing have been worked out and initial discussions with bioinformatic groups from the proteomics field have taken place. It has been realized that the topics currently discussed in proteomics on data sharing [262] are very similar to the aims of EUROCarbDB. Therefore, the next steps will be to establish a closer collaboration with bioinformatic groups in proteomics and genomics which will hopefully result in the long-term establishment of glycoinformatic concepts at the European Bioinformatics Institute (EMBL-EBI) or the National Center for Biotechnology Information (NCBI) and a bioinformatics center in Japan. In conclusion, bioinformatics for glycomics has evolved beautifully over recent years and is ready to be invited to show up at the ball with proteomics and genomics in order to waltz together.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.Molecular Structure Analysis Core Facility-W160Deutsches Krebsforschungszentrum (German Cancer Research Centre)HeidelbergGermany