Introduction

Recent advances in proteomics and related computational science, combined with advanced cell biology technologies, are uncovering molecular mechanisms for various biological processes. In clinical science, much attention is focused on understanding the pathophysiology of human disease, especially neoplastic cell transformation, to develop new therapeutic targets and to discover biomarkers that correlate with early diagnosis, drug development, and toxicology. The US Food and Drug Administration has approved 20 molecular biomarkers for clinical use to detect and monitor human diseases, including cancer (1). Most of these markers were identified by hybridoma screening, where the antibodies were generated by immunization with target cells, followed by antigen identification (2); however, proteomics is an alternative approach for biomarker discovery that would permit quantitative analysis of protein changes associated with disease development, such as tumor growth, in a “genome-wide” scale.

Numerous studies have aimed to discover new cancer biomarkers, mostly through differential protein analysis between normal and cancerous tissue samples. For instance, two-dimensional gel electrophoresis analyses of various tumor samples taken during surgery or captured by laser microdissection have provided candidate cancer biomarkers (35). Likewise, proteomic analyses of bodily fluids, such as plasma and urine, have been used to identify tissue-specific diagnostic biomarkers for early-stage cancer (1, 68). Human plasma has been the primary focus because its components originate from a variety of tissues/cells. However, the difficulty in analyzing the plasma proteome is widely recognized—Notably, there is enormous dynamic range with regard to the highest and lowest concentration of components, and a single component, albumin, accounts for approximately half of the total protein mass in plasma (55mg ml−1); moreover, roughly ten major proteins comprise 90% of the total protein mass. In contrast, the trace plasma components such as the cytokine interleukin-6 are present at 1–5pg ml−1; the difference in concentration between albumin and interleukin-6 is thus 1010 (9). Because this dynamic range appears far beyond the analytical range (103–104) of current proteomic technology, it is clear that new and highly sensitive proteomic methods for enrichment of trace biomarker candidates are key for searching for new diagnostic biomarkers in plasma.

Glycoproteins are potential diagnostic biomarkers because most secretory proteins are glycosylated, and their glycan structures frequently and drastically change during tumorigenesis as well as in normal cell differentiation. Glycans are the first cellular components encountered by approaching cells, pathogens, antibodies, or other molecules. In addition, they are often used as specific cell biomarkers at different stages of differentiation. Different cell types express different glycan signatures. These two fundamental characteristics of glycoproteins, i.e., common posttranslational modification in secretory proteins and lineage-specific signatures of glycans, make them ideal candidates for cancer biomarkers. Most cancer-related biomarkers available to date, such as CA19.9, prostate-specific antigen, CEA, and α-fetoprotein, are glycoproteins or glycoconjugates (1). In this review, we summarize mass spectrometry (MS)-based glycoproteomic technologies for cancer biomarker discovery, with a special focus on pre-proteomic analysis enrichment of glycopeptides from complex biological mixtures and their assignments and structural analysis by MS.

Enrichment of Glycopeptides for MS-Based Proteomics

Although there are reports of glycoproteome analyses based on electrophoresis techniques (1012), the MS-based “shotgun” approach has advantages in speed, sensitivity, and automation. It is particularly suitable for glycoproteomics analysis because this method allows for the identification of glycoproteins regardless of their subcellular localization (many glycoproteins are integrated into membranes) or structure (most glycoproteins are heterogeneous in charge and molecular size and thereby give rise to multiple or smear spots in two-dimensional polyacrylamide gel electrophoresis). Thus, most of the large-scale glycoproteome analyses have used the shotgun approach; however, the key goal is to capture minor glycopeptides efficiently from samples containing large numbers of non-glycosylated peptides produced by proteolytic digestion of complex protein mixtures such as crude cellular extracts, tissues, or plasma.

Affinity Chromatography for Capturing Glycopeptides

Lectins

More than 2,000 lectins have been detected from various sources (1315), and about 200 of those are characterized in amino acid sequence, hemagglutinating activity, tertiary structure, etc. (http://proline.physics.iisc.ernet.in/cgi-bin/cancerdb/input.cgi/; and http://nscdb.bic.physics.iisc.ernet.in/). Although the binding specificities and kinetic parameters remain largely unknown, they are useful tools to capture, concentrate, and classify glycoconjugates, including glycoproteins and glycopeptides (Fig. 1). Because non-reducing ends of naturally occurring glycans are limited to mannose, galactose, N-acetylglucosamine, sialic acids, and rarely N-acetylgalactosamine, a number of lectins with binding specificities to these glycans can be used to capture subsets of the glycoproteome (Table 1). For a more comprehensive or systematic collection of glycopeptides from complex biological mixtures, multiple lectins with distinct binding specificities are used in combination or in series (1620). For instance, we applied three lectin columns bound with concanavalin A (ConA), wheat germ agglutinin (WGA), or worm galectin 6 for a comprehensive analysis of N-linked glycoproteins in Caenorhabditis elegans and identified 1,465 N-glycosylated sites on 829 unique proteins, including 224 putative secretory and 432 membrane-bound N-glycoproteins with single or multiple transmembrane segments (21). Interestingly, the glycosylation site was used as a landmark for analysis of the topology of membrane-bound N-glycoproteins because the initial protein glycosylation takes place only in the endoplasmic reticulum lumen and the glycosylated segments do not cross the membrane bilayer. Based on a topological analysis of 257 N-glycosylated proteins containing putative single transmembrane segments, we suggested an atypical non-cotranslational translocation mechanism for integral membrane proteins (21). We also note that many N-glycoproteins identified in this study have mammalian counterparts that are classified as disease-related genes/proteins (data not shown). Yang et al. (22) performed a comparative glycoproteome analysis of sera from breast cancer patients and normal controls using a mixed-bed column of Jacalin-, ConA-, and WGA-agarose. The authors identified 813 glycoproteins in the serum samples, including low-abundance components such as neuropilin-1 and pregnancy zone protein, and found a difference in the expression of a number of proteins associated with lipid transport and cell growth. Likewise, Drake et al. (23) analyzed human plasma glycoproteins captured by various lectins and found a difference between sera from subjects with benign prostatic hyperplasia and prostate cancer or between sera from hepatocellular carcinoma patients and control subjects. Thus, MS-based analysis of lectin-captured glycopeptides allows for large-scale glycoproteome analysis and identifies cancer biomarker candidates.

Fig. 1
figure 1

Methods to capture glycopeptides. (1) Lectin/antibody-mediated affinity chromatography, (2) hydrazide chemistry-based covalent capturing, (3) BEMAD and subsequent disulfide formation, (4, 5) metabolic incorporation of azidosugar and ligation of biotin/FLAG tags via (4) Staudinger ligation or (5) Huisgen cycloaddition, and (6) enzymatic addition of a keto-derivative of Gal onto O-GlcNAc and subsequent incorporation of biotin tag. Black hexagon sugar chain, gray hexagon Gal, white hexagon O-GlcNAc

Table 1 Lectins used for glycopeptide capture and their binding specificity for glycans

To assist lectin-based analyses of glycoproteins, an automated high-throughput method, based on frontal affinity chromatography (24), was recently developed (25) and applied to the comprehensive interaction analysis between ∼100 lectins and ∼100 glycans (the data for 50 typical lectins will become available soon at the Japan Consortium for Glycobiology and Glycotechnology Database, http://www.jcgg.jp/E/index.html). Though certain lectins have a broad binding specificity to glycans and the binding capacities of lectins are affected by the tertiary structure of glycoconjugates (26), a large-scale dataset of the lectin/glycan interactions will help to select lectins for glycoproteome analysis.

Antibodies Against Glycans

Antibodies against glycans have been used to capture glycoproteins having a static glycan structure (2729). Lewis X is a prominent member of the Lewis blood group antigen family that can be found on glycoproteins, glycolipids, and proteoglycans. Its antigenicity is noted by the fact that many research groups have generated monoclonal antibodies against this trisaccharide structure while studying developmental processes or cancer. This type of antibody is applicable for glycoproteomics analysis; however, the application of antibody-mediated glycopeptide capture is limited because the glycans, especially N-glycans, are generally poor antigens because of their structural conservation among immunized animal species.

Glycoprotein Receptors

Glycoprotein receptors are alternative tools to collect specific glycoprotein subsets. For instance, Sleat et al. used mannose-6-phosphate (M6P) receptors to capture N-glycoproteins and identified many known, as well as unknown, M6P-motifs containing glycoproteins in human brain lysosomes (30), plasma (31), and urine (32) samples. The authors suggested that the method might be able to search for biomarkers of lysosomal disorders.

Chemical Coupling to Capture Glycopeptides

The method developed by Zhang et al. (33) captures glycopeptides on a solid support by chemical coupling between cis-diol groups of the glycan and hydrazide on the support. N-linked glycopeptides are then released from the resin by digestion with peptide: N-glycanase (PNGase; Fig. 1). Unlike lectin affinity chromatography, the method captures N-linked glycopeptides regardless of the glycan structure. Modification of this original protocol includes incorporation of a stable isotope tag into peptides attached on the solid support via the succinimidation of amino groups for quantitative analysis (33) or introduction of superparamagnetic silica particles with a hydrazide group to facilitate high-throughput analysis (34). The method was applied to various biological samples including human plasma (35, 36), plasma of healthy and trauma patients (37), platelets (38), saliva (39), prostate cancer epithelial cells (33), and a microsomal fraction of cisplatin-resistant ovarian cancer cells (40).

On the other hand, a unique technology based on β-elimination followed by Michael addition with dithiothreitol (DTT), termed BEMAD, captures O-linked glycopeptides (41) (Fig. 1). The introduced thiol group is attached to a thiol-containing solid support, such as thiol-Sepharose, through a disulfide bond. After removing non-O-glycosylated peptides by washing the support with appropriate buffer, the captured formerly O-glycosylated peptides are recovered by elution with DTT. Because the β-elimination reaction also occurs at O-phosphorylated or O-sulfated Ser/Thr residues in the sample mixture, the method should be coupled with an enrichment of O-glycosylated peptides by, for instance, lectin affinity chromatography on a WGA column. By using this method in combination with MS3- and electron-transfer dissociation (ETD)-MS techniques, Vosseller et al. (42) identified 65 O-glycosylated peptides in 18 proteins from mouse brain postsynaptic density preparations and predicted the substrate specificity of mammalian O-GlcNAc transferase.

Carbohydrate-Tags via Metabolic or Chemo-enzymatic Labeling

Two methods have recently been developed to introduce a specific affinity tag to capture glycoproteins and glycopeptides from complex mixtures (Fig. 1). One method, called the “tagging-via-substrate” approach, utilizes peracetylated azidomonosaccharides or thiol derivatives for metabolic incorporation of the artificial sugar moiety into glycans synthesized in cultured cells or in animals such as mice (4350). For instance, administration of N-α-azidoacetylmannosamine into culture media results in incorporation of its metabolite, N-α-azidoacetyl sialic acid, into glycoconjugates, including N-glycoproteins. The azide group is then reacted with phosphine compounds with a biotin or FLAG peptide tag through the Staudinger ligation reaction (43) or with alkyne compounds with similar tags through Huisgen [3+2] cycloaddition to label glycoconjugates (45). The tagged glycoproteins are affinity-captured with avidin or an antibody against FLAG. These new techniques have identified many O-glycosylated proteins in mammalian cells (43, 47, 48). However, the method needs improvement to identify N-glycoproteins because of the metabolic intolerance of the artificial sugar moiety to incorporate the azide group into N-glycans.

Another method, the “chemo-enzymatic” approach developed by Khidekel et al. (51, 52) utilizes a genetically engineered galactosyltransferase to incorporate ketone analogs of galactose to cellular O-glycosylated proteins, followed by incorporation of a biotin label through coupling with aminoxy-biotin. The method identified 25 O-glycosylated proteins in rat brain.

LC-Based Enrichment of Glycopeptides

Hydrophilic Interaction LC

Because of the hydrophilic nature of glycopeptides, hydrophilic interaction liquid chromatography (HILIC) on dextran/cellulose-based supports, such as Sepharose (53, 54), or zwitterionic resin (55) has been used to enrich glycopeptides and glycans. In HILIC, peptide mixtures were dissolved in a hydrophobic solvent, such as a mixture of 1-butanol, ethanol, and water (4:1:1, v/v), applied to the column, and adsorbed glycopeptides eluted by decreasing the concentration of 1-butanol in aqueous ethanol. We coupled concurrent multiple lectin affinity chromatography with HILIC and significantly improved the purity of N-linked glycopeptides prepared from crude extracts of C. elegans (21) or mouse tissues (Kaji et al., unpublished).

Size Exclusion Chromatography

Among the in silico tryptic peptides of human proteins listed in the National Center for Biotechnology Information database, more than 90% are estimated to have masses smaller than 2,000Da (56). On the other hand, masses of N-glycans are larger than 1,200Da; thus, most N-glycopeptides able to be identified by MS/MS as deglycosylated forms could be enriched by size-exclusion chromatography, as demonstrated by Alverez-Manilla et al. (56).

Boronic Acid

Boronic acid forms boronic diesters through reaction of geminal diols. Using this reactivity, boronic acids have long been used for glycoprotein enrichment. This method has recently been used to detect low-abundance glycoproteins in human blood samples (57).

Strong Cation Exchanger

Glycopeptides with a terminal sialic acid can be enriched by liquid chromatography (LC) on a strong cation exchanger column. By this method, Lewandrowski et al. (58) identified 148 glycosylation sites on 79 sialylated glycoproteins in a membrane fraction of human platelet.

Titanium Dioxide

It has been shown that TiO2 column, developed originally to capture phosphopeptides, was also effective to enrich sialylated glycopeptides, probably because sialic acid forms relatively stable bidentate bridge with TiO2 ligand (59). The method was applied successfully to identify ∼100 sialo-glycoproteins in human plasma or saliva, respectively.

Identification and Structural Analysis of Glycopeptides

Assignment of Glycopeptides

Peptide assignment in the shotgun analysis depends on tandem MS (MS/MS) analysis of the fragment ions generated by collision-induced dissociation (CID). Direct CID-MS/MS analysis of glycopeptides, however, generates preferentially a series of fragment ions derived from the dissociation of glycosyl bonds rather than peptide bonds, and thus, the glycan moiety needs to be removed before MS analysis for efficient peptide assignment. Deglycosylation has additional advantages: (1) The process increases the analysis sensitivity because it yields non-glycosylated peptides by removing multiple glycan forms having different chemical characteristics, and (2) the process allows the exploitation of stable isotope labeling of glycosylated sites on peptides to improve the fidelity of glycopeptide identification (60, 61). The stable isotope label not only indicates the formerly glycosylated site but also acts as a mass-tag for quantitative analysis (61).

Enzymatic or chemical processes are available for the deglycosylation reaction for large-scale analysis of glycopeptides (Fig. 2). PNGase releases N-linked glycans of glycopeptides, regardless of the glycan structure. The isozyme, PNGase F, is most widely used (6163); however, PNGase A more effectively removes the N-linked glycans with core α(1,3)-fucose, which are found in N-glycoproteins synthesized in plants, insects, C. elegans, etc. Endo-α-N-acetylglucosaminidases (Endo D/H. EC3.2.1.96.) remove N-linked glycans from glycopeptides, leaving a single GlcNAc residue attached to the Asn (Fig. 2). In cases where proximal GlcNAc residues in the chitobiose core are modified with Fuc, the core-fucosylated proteins are identified by Endo D/H digestion after previous digestion of the glycopeptides with sialidase, β-galactosidase, and N-acetyl-α-glucosaminidase (55, 64).

Fig. 2
figure 2

Identification of glycopeptides after deglycosylation. (1) Deglycosylation of N-glycan by PNGase and concomitant conversion of Asn to Asp that allows a glycosylation site-specific incorporation of 18O. (2) BEMAD. (3) Partial deglycosylation by Endo D/H treatment with exoglycosidases. Proximal GlcNAc remains on the modified Asn with/without core Fuc. Black diamond Sia, black circle Gal, black square GlcNAc, white circle Man, white triangle Fuc

O-glycans can be removed from O-linked glycopeptides through the alkaline β-elimination reaction. After β-elimination, several tags can be introduced at the glycosylated site by reacting the dehydro-intermediate with various thiol or amino compounds. Because the efficiency of the reaction depends largely on the structure of glycopeptides, the BEMAD reaction has been used for most O-GlcNAc site mapping (41, 42).

Direct Structural Analysis of Glycopeptides

Direct MS determination of glycan structure is extremely important for biomarker discovery, as the glycan structure attached to a particular glycoprotein is frequently and drastically changed during cell differentiation or tumorigenesis. High-speed multi-stage MS/MS technology, or MSn, has emerged as a technology that enables both peptide assignment and structural determination of posttranslational modifications, such as the site of glycosylation and the attached glycan structure (6570). The method determines the structure by collecting information from a series of fragment ions generated by multiple CID processes, typically three times (MS3; Fig. 3). It has successfully determined the three glycosylated sites and the glycoform structures in Thy-1, a glycosylphosphatidylinositol-anchorred protein (67), and determined the structures of neutral and sialylated N-glycans attached to chicken egg yolk glycopeptides (69). Thus, the MSn technology will be a powerful tool for glycoproteome analysis after the collection speed of MSn spectra is increased to the LC-MS time scale.

Fig. 3
figure 3

Identification and structural analysis of glycopeptides. Glycopeptides eluted from LC is introduced into MS through an electrospray interface. CID of the precursor ion produces fragment ions dissociated preferentially at glycosyl bond. The MS3 fragments of peptide provide amino acid sequence information assigned by database search tools, and the MS3 fragments of glycan moeties provide structural information on the glycan attached on the glycopeptide. Alternatively, ECD/ETD of the glycopeptide precursor ion cleaves preferentially at peptide bonds and provides the amino acid sequence and a glycosylated amino acid with an additional mass of attached glycans. Thus, the detailed glycan structure can be determined by concurrent measurement with CID-MSn

Another approach to identify the core peptide of a glycopeptide is via electron-capture dissociation (ECD)/ETD technologies. ECD/ETD causes peptide bond cleavage without cleavage of labile bonds between the peptide and conjugated groups such as glycans and phosphates or within glycans (71–76). Although the application of this method has been limited to model glycopeptides, ECD/ETD coupled with CID has the potential for complete chemical structural determination of glycopeptides in complex biological mixtures.

Quantitative Glycoproteomics

Differential Analysis

Most technologies for quantitative proteomics are based on differential stable-isotope labeling, such as in vivo metabolic labeling of cultivated cells (77) and in vitro chemical labeling using ICAT (78), MCAT (79), iTRAQ (80, 81), and 13C, 15N-double labeled MCAT reagents (82). These can be used in combination with various technologies to capture glycopeptides. For instance, Zhang et al. (33, 83) estimated a quantitative difference in several mouse plasma glycoproteins before and after tumorigenesis by chemical capture of N-glycopeptides and stable isotope labeling (d0/d4) of the N-terminal amino group with succinimidation. Ueda et al. (84) identified 34 human plasma glycoproteins with different levels of α-1, 6-fucosylation between lung cancer patients and healthy controls by stable isotope labeling of Trp residues with 2-nitrobenzensulfenylation. On the other hand, two methods allow incorporation of mass tag specifically to glycopeptides; one is BEMAD using deuterium-labeled DTT (42), and the other is PNGase-mediated incorporation of 18O specifically into Asn residues at N-glycosylated sites (61). We confirmed the feasibility of the latter approach and applied it to large-scale mouse glycoproteome analyses (Kaji et al., to be published, Fig. 4).

Fig. 4
figure 4

Quantitative LC/MS/MS analysis of N-glycopeptides labeled by PNGase-mediated isotope-coded glycosylation site-specific differential labeling. Two N-glycopeptide samples are labeled differentially with PNGase in either \(H_2 ^{16} O\)(light: L) or \(H_2 ^{16} O\)(heavy: H), combined, and analyzed by LC/MS/MS. The molecular mass of a deglycosylated peptide increases by 1Da (L) or 3Da (H) unit from the calculated mass by the PNGase mediated conversion of N-glycosylated Asn to Asp. The relative content of peptide in the two N-glycopeptide samples is estimated from the signal ratio between the “light” and “heavy” peptides, after correction of the signal overlaps because of natural abundance of isotopes. a Mass spectrum of a deglycosylated peptide of human transferrin, residues 421–433: CGLVPVLAENYNK, treated in \(H_2 ^{18} O\). b, c Mass spectra of a mixture of peptides treated with \(H_2 ^{16} O\)(L) and \(H_2 ^{18} O\)(H) at a ratio of 1:3 and 1:1, respectively

Target-Based Analysis

MS is widely used to quantify specific small molecules such as drugs, drug metabolites, hormones, etc., with excellent precision and sensitivity (85, 86). In these methods, a sample is introduced from LC through an ionizing spray, typically into a triple-quadrupole MS. Within the MS, the first mass analyzer is set to pass the target precursor molecular ion, rejecting components of other mass-to-charge ratios (m/z). The target molecule is then fragmented in a collision chamber by CID and passed to a second mass analyzer set to pass a known specific fragment. This two-stage selection affords great specificity and thus improves the signal-to-noise ratio and allows the quantification of target molecules by integrating the precursor ion signals eluted by LC. An internal standard, labeled with stable isotope, is often spiked into the sample to provide a reference to which the target molecule is compared. This technology, selected reaction monitoring (SRM) applied for MS-based quantitative analysis of small molecules, has been introduced to quantitative “shotgun” proteomics to measure a particular peptide in a crude sample with high selectivity and sensitivity (8790). Thus, the quantification of target peptide derived from particular proteins, such as biomarker candidates, is performed by spiking a defined amount of an appropriate internal standard peptide (labeled with a stable isotope) into sample mixtures. Multiple reaction monitoring (MRM), a modification of SRM, measures the signal intensities of pre-listed ions or peak areas of both precursor ions and their CID-fragmented ions and thereby identifies and quantifies the target peptide accurately and precisely by comparison with reference peptide standards. MRM has been applied to estimate the levels of five minor glycoproteins in human serum (91), to measure the levels of biomarker candidates in a mouse model of breast cancer (92), and to study the N-linked glycosylation reaction in congenital disorders of glycosylation type-I serum (93). Thus, MRM and SRM are extremely powerful for biomarker discovery and candidate validation (92). One of the obstacles to expanding the MRM approach to large-scale studies is the need to synthesize numerous reference peptides containing stable isotopes; however, this problem has been partly resolved by recent technological developments, such as QconCAT (94), in which many isotopically labeled peptides can be synthesized stoichiometrically as a concatenated precursor by expression of genetically engineered DNA in cells cultured in media containing stable isotope-labeled amino acids.

Conclusions

The MS-based glycoproteomics technologies described here allow identification of thousands of glycoproteins in plasma or crude cell extract and their sites of glycosylation and enable detection of quantitative changes associated with normal and aberrant cellular processes. These technologies contribute significantly to our understanding of protein glycosylation; however, the discovery of serodiagnostic biomarkers is still challenging and requires further improvements in speed, resolution, and in the dynamic range of analysis. One promising approach to improve the dynamic range and to detect trace plasma components is to combine the glycopeptide capture methods described here with other analytical techniques to concentrate a particular subset of the proteome. For instance, Liu et al. (37) combined immunoaffinity subtraction of abundant serum proteins (c.f., albumin, immunoglobulin, transferrin, etc., which comprise ∼90% of the total protein mass) with subsequent chemical fractionation based on cysteinyl peptide and N-glycopeptide capture, and they identified 2,910 unique human plasma N-glycopeptides that correspond to 662 N-glycoproteins and 1,553 N-glycosylated sites and assigned numerous low-abundance plasma components including 78 cytokines and cytokine receptors and 136 human cell differentiation molecules, e.g., interleukin-1, macrophage colony-stimulation factor and tumor necrosis factor receptor-1, which are present at the nanograms per milliliter level in plasma (37). Thus, a proteome-wide “discovery-based” proteomics approach is coupled with a “target-based” approach, in which quantitative MS methods such as MRM are used to evaluate limited sets of candidate biomarkers in large sets of clinical samples. Finally, it should be noted that technical advances in MS for comprehensive structural analysis of glycans attached to glycopeptides (95) is also critical for the discovery of new biomarkers because the structure of glycan moiety is extremely sensitive to the state of cell differentiation and tumorigenesis.