1 Introduction

In 2019, an unknown serious acute respiratory disease broke out. The coronavirus responsible was soon isolated and sequenced by Chinese scientists [1,2,3], and was named SARS-CoV-2 as its highly homological sequence to SARS virus by World Health Organization [4, 5]. The disease caused, COVID-19, became the first corona pandemic in human history and caused tremendous losses worldwide, with over 161 million infected cases and 3.35 million deaths by May 15, 2021 [6].

SARS-CoV-2 primarily transmits through the droplets to respiratory systems, although other dissemination paths and system injuries are reported as well [7,8,9]. As a typical β coronavirus as SARS and MERS, it binds to angiotensin-converting enzyme II (ACE2) in host cells, conducting penetration, biosynthesis, maturation in several organs [10,11,12,13]. It causes a multitude of symptoms mainly including dry cough, fever, etc. [14] Besides lower respiratory tract infection, other symptoms including diarrhea, headache, etc., have also been observed [15, 16]. Some symptom-free coronavirus carriers remain symptom-free after 14 days of observation, but still carry around substantial amounts of the virus [17, 18]. Moreover, with the emerging immune escaping viral strains, COVID-19 might become a common health threat to humans like the seasonal flu [19,20,21]. It is therefore urgent to comprehend the association between molecular fingerprints and clinical outcomes [22,23,24].

Biomarkers are bio-molecules that correlated to particular biological or disease states [25]. Viruses are known to affect the host proteome to achieve optimal growth [26]. The formation of protein interaction between a virus and the host is the determinant factor for the host adaptation process of the virus [27, 28]. More importantly, the abundance and modification status of immune response proteins will directly reflect the immune status of the host [29, 30]. Due to the crucial roles of proteins in the biological and pathological process, protein biomarkers have been identified in Zika, Ebola, swine flu, etc. [31,32,33] Meanwhile, glycosylation arguably serves as one of the top factors in modulating protein location, binding affinity, activity, and fate [34]. Glycosylation also serves critical roles in the life cycle of the virus, including protein folding, receptor binding, adjusting degradation rates, controlling tropisms, and shielding immunogenic epitopes from the immune system [35]. Virus-induced diseases significantly affect the glycosylation status in the host, impacting the abundance and species of glycans, increasing or decreasing their expression, or even leaving the glycan structure incomplete [36, 37]. The level and species of glycans of protein glycosylation are closely related to the chemical status in the cellular microenvironment, particularly sugar levels and species that might be changed by disease, thus having long served as a source of disease biomarkers [38].

Around 20% of COVID-19 cases are reported to develop serious symptoms and even lead to death [39]. Although age, gender, pre-existing diseases, and unhealthy lifestyle are reported to be significant factors in influencing the disease severity of COVID-19 patients, they could not be used to forecast prognostic outcomes directly [40]. Based on extensive studies on COVID-19 pathogenesis and clinic features, it is now clear that the disease progression is closely related to the host’s immune response [41,42,43,44]. Pathogen infection is fundamentally manipulated by eliciting host metabolism: viruses adapt to the host metabolic environment and begin to replicate, thrive, and continue to invade neighboring cells. Therefore, understanding the pathogenic role of viruses using host metabolism is a possible future therapy to stop and defeat viruses [45, 46]. A thorough and detailed analysis of the protein, as well as the metabolome status of COVID-19 patients, should reveal important knowledge of host immune responses to infection, shedding light on their potential links with the clinical outcome of the patients [47].

In another aspect, mass spectrometry (MS)-based approaches, particularly MS-based proteomics and glycoproteomic analysis, are widely applied to explore host response to the pathogen, including infection, invasion, persistence, and pathogenesis, and can initially guide diagnosis to prevent the disease from turning into severity [48, 49]. A unique strength of MS-based strategies lies in the fact that it reveals a non-biased profile of protein/metabolite status of patients in a high-throughput manner [50]. By comparing longitudinal or cross-sectional molecular profiles of COVID-19 patients, extensive studies have been endeavored to search for biomarkers based on MS analysis [22, 23, 51,52,53,54]. Herein, we will focus on the current progress and knowledge obtained of COVID-19 protein biomarkers, primarily identified by MS-based proteomics approaches. We will also discuss the potential use of glycoproteomics in searching for COVID-19 clinic biomarkers and potential challenges in MS-based protein biomarker investigation.

2 MS-based Proteomics

It is well-recognized since the early stage in the biological study that proteins play a fundamental role in life, encompassing metabolism, signaling generation, transduction and regulation, immune response, molecule transportation, structural organization, etc. [55,56,57,58,59]. The proteome is defined as whole proteins translated from the genome [60]. As the systemic study of diverse properties of the proteome [61], proteomics analysis provides a global figure of biological process at the level of protein, including protein identity and abundance, modification status (i.e., post-translational modification, PTM) as well as interactions between each other, thereby indicating the current physiologic and pathological state [62].

Over the past twenty or so years, proteomics has developed a multitude of methods to discover disease biomarkers due to the rapid evolution of MS-based techniques, such as matrix-assisted laser desorption/ionization (MALDI), electrospray ionization (ESI), Orbitrap, Fourier Transform-MS, etc. [62]. Normally, MALDI-MS is applied in relatively simple peptides analysis, while LC–MS/MS is preferred to analyze multi-peptide mixtures. Two overall strategies are commonly applied to identify/characterize proteins in a particular sample: bottom–up or top–down approaches. Top–down proteomics introduces the intact proteins into MS for the analysis of both intact proteins and their fragments, which allows integrated sequence coverage and full characterization of proteoforms but presents defects in sensitivity and throughput [63, 64]. While the most popular strategy to determine protein profiles is the shotgun proteomics attributed to the bottom–up approach (Fig. 1). Shotgun proteomics first uses enzymes for protein mixtures digestion and then the separation of peptides by LC. After that, the peptides were subjected to primary MS to obtain the mass–charge ratio as well as the corresponding signal intensity. Subsequently, a specific peptide ion is delivered for fragmentation to obtain its MS/MS spectrum [65]. The different choices of peptide fragmentation include higher-energy collisional dissociation (HCD), collision-induced dissociation (CID), or electron-transfer dissociation (ETD) which are well covered in other reviews and hence will not be discussed here [66, 67]. What is worth mentioning is that ETD performs the ionization by transferring electrons to a multiply protonated peptide so is widely used to identify PTMs on the peptide, and in certain cases will be combined with HCD to identify modifications [68]. The obtained raw data will then be analyzed to acquire peptide sequence and quantitation in software, such as MaxQuant, Mascot, etc.

Fig. 1
figure 1

A workflow for COVID-19 protein biomarkers identification. The workflow for COVID-19 protein biomarker identification using mass spectrometry (MS)-based proteomics. SARS-CoV-2 infects patients with a different precondition, causing different degrees of symptoms. Urine, plasma or serum, and pharyngeal swab samples were collected from mild and severe patients and healthy controls. As an example, blood samples were subjected to several proteomic processes, including denaturation, thiol-alkylation, digestion, and other steps. MS spectra were compared and then subjected to multivariate statistical analysis, which can analyze protein biomarkers for COVID-19 severity with different levels

The quantitative proteomics methods could be divided into relative and absolute quantification [69]. The core of relative quantification is the comparison of differences, i.e., the quantitative analysis of two or more samples under different physiological and pathological conditions by MS on a large scale and high throughput to obtain precise differences in protein expression, mainly utilizing stable isotope labeling and non-labeling techniques [70, 71]. Absolute quantification is achieved by obtaining the specific amount of protein expression, using MS to monitor the unique peptide of the target protein to obtain the peak area, and comparing it with the known amount of standard peptide (external labeling) or stable isotope-labeled peptide (internal labeling) to determine the specific amount [72, 73]. In contrast, relative quantitation shows the relative fold of change to a common quantitation base and is widely used to indicate the up- or down-regulation of proteins in response to stimuli.

However, when it comes to the choice of quantitation methods, a more important question is whether we want to focus on a particular group of limited candidates (targeted quantitation) or use an unbiased approach to explore as many candidates as we could (non-targeted quantitation, Fig. 2). Traditionally, people tend to follow an “unbiased—biased—validation” route for biomarker discovery [74]. Because the unbiased (non-targeted) quantitation in combination with the profiling proteomics approach can provide a whole picture, and the significantly changed candidates could be further explored in a larger group of clinic samples by the biased (targeted) quantitation method. In the case of COVID-19 biomarkers discovery, given a race with the virus is still ongoing, omitting one step to accelerate the process is reasonable. A variety of different quantitation approaches are then available for choosing, depending on which way we go for: targeted or non-targeted.

Fig. 2
figure 2

A general quantitation method for biomarkers identification. Quantitative approaches for biomarker discovery could be generally divided into non-targeted and targeted methods. Non-targeted proteomics allows a systematic and comprehensive analysis of proteins in samples and is generally used as the first step for biomarker discovery. Depending on whether labeling reagent was used, non-targeted proteomics can be further divided into two categories. The widely used labeling are isobaric tags for relative and absolute quantitation (iTRAQ) and tandem mass tag (TMT), while the data-independent acquisition (DIA) and spectral counting are common methods for quantitation in a labeling free manner. Targeted quantitation, in contrast, is a biased strategy that focused on a small set of biomarker candidates. Common methods of targeted quantitation include well-established methods like multiple reaction monitoring (MRM), selected reaction monitoring (SRM), and newly emerging parallel reaction monitoring (PRM), which provide full MS2 spectrum to confirm the identity of the target in addition to quantitation information

Stable isotope labeling is a classical method and probably still the most common approach for non-target quantitation, involving use of isobaric tags for relative and absolute quantitation (iTRAQ) and tandem mass tag (TMT) [75, 76]. A detailed workflow is shown in Fig. 3 using iTRAQ as an example. In contrast, the label-free method like spectral counting provides an alternative for quantitation. However, all three methods belong to the data-dependent acquisition (DDA) method, where precursor ions are chosen based on signal intensity for MS2 analysis. DDA method is known to be prone to the loss of low-abundance peptides and has certain randomness and an uneven number of scans [77, 78]. The recent development of data-independent acquisition (DIA), from a different aspect, divides the whole scanning range into several windows, each window is selected and fragmented in turn, and all the daughter ions of all parent ions within the window are collected [79]. DIA is subjected to neither specific target peptide nor upper limit of flux and has a uniform number of scanning points. Therefore, DIA can achieve qualitative confirmation and quantitative ion screening, which has advantages over traditional DDA [80]. For targeted quantitation, multiple reaction monitoring (MRM) serves as the gold standard in quantitation for many biomarkers, while newly emerged parallel reaction monitoring (PRM) provides MS2 spectrum in addition to quantitation information, therefore increases the reliability as well as specificity of quantitation [81, 82].

Fig. 3
figure 3

A workflow for iTRAQ-based protein quantification. iTRAQ-based protein quantification (in the case of 4-plex) is based on qualitative analysis, i.e., after protein extraction, thiol-alkylation and digestion, the resulting peptides are labeled and mixed with iTRAQ reagent, followed by liquid chromatography separation and analysis using tandem mass spectrometry. A database search of the peptides fragments allows the identification of the labeled peptides and thus the corresponding proteins. The reporter ions generated by the fragments can be used to quantify the peptides and the proteins from which they originate

Body fluids are considered to be the most promising material for the search of disease biomarkers because of their direct association with various tissues, where proteins are found to be secreted or emitted [83]. Back in 2006, Mann et al. constructed the first protein map of body fluids originated from human beings using high-precision, high-resolution MS for large-scale analysis of a variety of body fluids [84,85,86]. In 2014, Kim et al. presented two large-scale human proteome map drafts using high-resolution Fourier transform MS for body fluids (including serum, saliva, urine, etc.) [87]. Among all kinds of body fluids, plasma is the most widely used material for biomarker profiling, largely because it influenced other body fluids to a certain extent, and it can be obtained by simple, non-invasive methods [88, 89]. Plasma is extremely complex, and dynamic ranges of plasma protein are the highest among all body fluids. It is estimated to be over 10 orders of magnitude, and high abundance proteins need to be masked to reduce background complexity before MS analysis [89]. Commonly used methods for processing plasma samples include antibody-depletion followed by fractionation, sometimes target proteins need to be isolated specifically [90].

Gordon et al. used plasma samples to systematically investigate the host interacting factors of SARS-CoV-2 while explored potential pharmacological compounds inhibiting viral replication [91]. For further apprehension of the interaction profile of proteins from both SARS-CoV-2 and host cells, they labeled and expressed 26 viral proteins. Because interaction profiles require the use of immunoprecipitation-based purification techniques, affinity purification MS was applied to identify host proteins that directly interacted with each target, resulting in the identification of over 300 protein–protein interactions. Ultimately it leads to the discovery of potential antiviral drugs, which may spawn a therapeutic regimen to conquer COVID-19. A similar investment conducted by Bojkova et al. [51] encourages the development of translation inhibitors to prevent virus replication. Several phosphoproteomics studies were also conducted by different research groups to identify key kinases in response to infection [52,53,54]. All these studies provide clues for biomarker identification in patient samples.

We searched the entire literature included in Web of Science using COVID-19, proteomics, and biomarker as keywords (Table 1). Except for the study by Wallentin et al. [92] which could obtain a great number of samples directly from the hospital because of support from the ARISTOTLE trial, the majority of the studies used plasma as the sample for the study, and the cohort size was no more than 50 cases. The protein biomarkers uncovered were mainly related to inflammation caused by the disease. For example, Shen et al. worked on selecting potential blood biomarkers for COVID-19 severity assessment [23]. They performed proteomic profiling of serum of 46 COVID-19 and 53 healthy subjects, each sampled at no more than 2 time-points. To obtain higher relative protein quantification accuracy, the study employed a stable isotope-labeled proteomics strategy (TMT) cooperating with orbitrap machine in DDA mode. A total of 93 differential expressed proteins in severe patient sera were finally identified. The study by Shu et al. focused on the host response to COVID-19 pathophysiology as well [22]. To investigate the immune response of distinct clinic outcomes and thus uncover protein markers associated with disease progression and tissue-specific protein alterations, this study performed proteomic analysis of 22 clinically diagnosed COVID-19 patients versus 8 healthy controls with up to 4 time-points each sample using TMT assisted LC–ESI–MS/MS in DDA mode. Validation by machine learning and enzyme-linked immunosorbent assay (ELISA) ultimately identified several altered plasma features. Intriguingly, the plasma proteins identified with significant alterations also included CRP besides acute-phase proteins (APPs), cholesteryl ester transfer protein (CETP), peptidase inhibitor 16 (PI16), etc. The results share commons with respective studies by Shen et al. [23] and by Gordon et al. [91], which implies a pathological association with inflammation and predicts proteinase receptors as pharmacological targets. Although the reviewed studies analyzing in DDA mode achieved promising results, DIA mode is considered to be better than DDA in accuracy and reproducibility due to the unfiltered detection of peptide mixtures, which can be deployed as an improvement.

Table 1 Summary table of SARS-CoV-2 biomarkers studies

3 Mass Spectrometry-based Glycoproteomics

3.1 The Overview and Techniques of Glycoproteomics

Glycome refers to the entire complement of carbohydrates produced by cells or tissues, which is composed of varied sequences and conjugated to proteins and lipids forming glycoproteins and glycolipids, respectively [93]. Glycomics is the systematic study on the comprehensive intra/extra-cellular glycome in specific spatiotemporal conditions and environments, indicating the cellular processes governed by interactions among glycans, proteins, and lipids [94, 95]. In many cases, bacteria and viruses infect the host via carbohydrate–carbohydrate interactions, which correlates to the ongoing COVID-19 pandemic as well [95,96,97].

Glycoproteomics is of great importance in the identification of bacteria and viruses as well as in immunology and pharmaceuticals, which is an emerging frontier in biomarker discovery. Although many glycoproteomics studies have been reported concerning the glycosylated spike (S) protein and the infection mechanism it caused between SARS-CoV-2 and host cell receptors, there is still no substantial discovery of a biomarker that can be applied in pathology and drug development [98,99,100,101,102]. However, based on previous studies on the glycoproteomics of SARS and MERS, glycoproteins would not be trivial modifications on the surface of SARS-CoV-2 and deserve further research [103, 104]. To profile the glycosylation status in COVID-19 patient plasma, the process can be enrichment, digestion, MS analysis, and data acquisition in that order. Enrichment strategies for glycoproteins must be following the target that aims to enrich either a certain protein or a broad pack of proteins. Based on the clear experimental target, a series of enrichment methods can be applied for glycoprotein fractionation and enrichment, including differential centrifugation, two-phase separation, biotinylation, etc [105,106,107,108]. Glycoproteomics analysis can be performed separately using different techniques, with structural glycoprotein characterization being the basis and their release from the glycan carrier being the key step. Unlike the linkage of the peptide backbone, the N- and O-linkage of monosaccharides and glycosaminoglycans (GAGs) require the choice of the corresponding release method. For example, for N-glycans with a common structure (Man3GlcNAc2), the use of Peptide-N-glycosidase F can divide the bond between asparagine and core GlcNAc of the N-glycan [109]. Chemical methods, e.g., hydrazinolysis [110] and β-elimination [111], can also be deployed to release the N-glycans, but are not as efficient as enzymatic cleavage. While for the complete release of O-glycans, no single enzyme is competent due to the diversity of Ser/Thr-linked glycans, preferring to use chemical methods.

Similar to proteomics, MS has become a major tool for both glycan and glycopeptides analysis. Compared to other means (e.g., nuclear magnetic resonance, exoglycosidase treatment, or lectin analysis), MS-based glycan analysis can yield a tremendous amount of structural information: MS detects glycan components, while MS/MS or multilevel MS can elucidate details of positional and linkage isomers [112, 113]. MALDI and ESI methods are the primary ionization methods [114] while LC–ESI–MS has the features of high sensitivity, low ion suppression, and provides a resolution of position and linkage isomers [115]. To homogenize the physicochemical properties of the polysaccharide pool and thus reduce ion suppression, glycans are often derivatized, such as stable isotope permethylation [116]. Glycoproteomics analysis by MALDI has the advantages of easy sample preparation, automation, fast data acquisition, and the ability to observe single ions [117]. And MALDI combined with derivatization techniques such as methylation can avoid the loss of acidic groups during ionization, making it the most used method for qualitative analysis. As the detailed methodology for glycan and glycopeptides is well reviewed particularly regarding its application in SARS-CoV-2 [118], we will not cover the details in experimental considerations and design in this mini-review.

3.2 Current Glycoproteomics Discovery Status

Glycosylation of viral envelope proteins is known as one way in which viruses mask epitopes of associated proteins [119]. The pathogen uses glycosylation to evade recognition of the host immune system, making it undetectable and possibly interfering with the host's adaptive immunity and even enhancing the infectivity of the virus [120, 121]. Several pathogenic viruses including HIV, influenza virus, SARS, and Zika virus, are known to use host-derived glycosylation events to build their glycoproteins on the virion surface, both N-linked and O-linked, to infect their target host cells [119]. Similar to other coronaviruses, the differential organization of glycosylation in SARS-CoV-2 affects not only the composition of individual glycans but also the immune stress on the entire viral protein surface, thereby exposing vulnerable areas in the dense carbohydrate layer of the surface [95, 103].

Glycoproteins not only remarkably behave in the viral surface envelope but also the binding effect of the receptor. This role of glycoproteins explains the mechanism of cross-species transmission of some coronaviruses and provides clues to trace the origin of the virus. Hulswit et al. reported that the glycosylated 9-O-Ac-Sia-specific receptor-binding site on β1-coronavirus binds to the glycoprotein of the receptor [122]. Qing et al. also reported that two receptor-binding sites of coronaviruses, one S1A would bind to host sialic acid, and the other S1B would recognize host transmembrane proteins, thus potentially triggering infections from zoonotic to human-to-human transmission [123]. However, due to the relatively low abundance of glycosylation events as well as the technical challenge to enrich and analyze glycopeptides, all the current findings are still in an early stage in terms of biomarker discovery.

The inevitable challenges in the virus glycoprotein investigations are the macro-heterogeneity and the micro-heterogeneity resulting from the mixture of diverse glycosylation at multiple sites of the protein [124, 125]. Macro-heterogeneity concerns the occupancy, presence, or even absence of glycans at glycosites, while micro-heterogeneity refers to the varieties of glycans at a specific glycosite [125]. Both forms of heterogeneity significantly influence the physical and biochemical properties of proteins [125, 126]. N- and O-glycans also exhibit unique heterogeneity. The diversity of N-glycan structures leads to high micro-heterogeneity and low macro-heterogeneity at specific sites [127, 128]. Variations in macro-heterogeneity are also evident for mucin-type O-glycosylation [129]. Mucin-type O-glycosylation usually has high peptide-level valence and micro-heterogeneity, which may facilitate the formation of multiple O-glycosylations in the neighborhood, but the glycans at each site may be different [129, 130]. Although MS-based glycoproteomics is capable of heterogeneity analysis of all classes of glycosylation, the complexity of glycoforms and variability in the chemical properties of glycoproteins relative to non-modified proteins make enrichment a necessary step before MS analysis [131,132,133,134]. Enrichment of glycopeptides isolates the glycans and glycoconjugates from the non-glycosylated background, thus greatly improving the sensitivity of MS analysis. Affinity chromatography is used as the most common enrichment strategy, based on using specific biochemical interactions of analytes with immobilized ligands to enrich substances of interest from background matrices [135, 136]. Lectins are one of affinity chromatography due to their recognition of carbohydrates and their ability to separate glycans and substrates with varying degrees of specificity [137]. Lectins are commonly used for glycoprotein enrichment include concanavalin A, wheat germ agglutinin, Ricinus communis agglutinin, galectins, and siglecs, and are usually combined with supporting materials, such as agarose or polystyrene-divinylbenzene [137, 138]. Two other types of affinity chromatography, immobilized metal affinity chromatography (IMAC) and metal oxide affinity chromatography (MOAC), offer greater advantages in enriching negatively charged glycans. IMAC and MOAC are both derived from phosphoproteomics with refinements [139, 140]. IMAC consists of transition metal cations (e.g., Fe3+, Ga3+, Ti4+, Zr4+) chelated to an immobilized substrate, whereas MOAC is a mix of transition metals in a metal oxide matrix (e.g., TiOx) [141, 142]. Both techniques exploit the affinity of deprotonated carboxyl groups to achieve enrichment and are particularly effective in enriching sialylated glycopeptides [140,141,142,143]. Hydrophilic interaction chromatography (HILIC) is another important tool for glycoprotein enrichment and characterization and can be applied to a wide range of biomass, such as biological fluids, cancer systems, pathogens, and plants [144,145,146,147]. HILIC exploits the hydrophilic nature of glycans to enrich glycopeptides: due to the difference between the semi-aqueous mobile phase and the hydrophilic stationary phase, enrichment is achieved when glycopeptides elute from the organic loading buffer into the hydrophilic environment [147,148,149]. The recent emergence of porous graphitic carbon (PGC), a chromatographic method capable of both polar and hydrophilic properties, overcomes the disadvantages of silica-based stationary phases and has proven to be an effective method for the separation and enrichment of glycopeptides and glycans [150,151,152]. In addition, the chemical method represented by hydrazide, which has been improved over the years, combined with the release of PNGase F, can also selectively react with glycan derivatives, thus becoming one of the methods for glycoprotein enrichment [153]. However, there is no one universal enrichment strategy for glycoproteomics so far. Different approaches can be adapted for the glycans of interest, which means that the experimental design needs to start from a practical goal and base on the experimental data to prove which approach is applicable.

Glycoproteomics requires much higher overall throughput, data quality, and accessibility for complete glycopeptide identification than conventional proteomics, posing new challenges for both algorithms and search engines. When analyzing intact glycopeptides, it is often necessary to combine multiple sample processing strategies, different MS/MS fragments, and various software to process data, which affects the throughput and quality of MS acquisitions [154,155,156]. Due to the lack of comprehensive quality control, search engine matches for all three aspects of glycans, peptides, and glycopeptides are prone to high false discovery rates (FDR) and lack of validation of spectral interpretation [154]. However, Liu et al. [157] developed a new MS acquisition method and a specialized search engine to address these limitations. By optimizing MS/MS collision parameters, this MS acquisition method can analyze integrated fragments of intact glycopeptides in a single spectrum. The search engine named pGlyco 2.0 can take full advantage of integrated fragments in a spectrum and thus control the quality of glycopeptide-spectral matches (GPSMs). Daniel et al. [158] developed an MSFragger-based glycoproteomics search engine, MSFragger-Glyco, which can search N- and O-glycopeptides quickly and sensitively. The identification results of this search engine are more than doubled the original search outcomes. Although glycoproteomics still faces many challenges, existing and continuing advances in technology continue to drive glycoproteomics to create great value in many fields, including virology and pharmacology.

3.3 Glycoproteomics Characterization of SARS-CoV-2 Spike Protein

Many studies declared that the SARS-CoV-2 carries out cell invasion through a densely glycosylated S protein [98,99,100,101,102]. As a trimeric class I fusion, S protein consists of two subunits S1 and S2, which are generated via proteolytic cleavage [98, 100]. S1 contains the receptor-binding domain (RBD) and is decisive for receptor recognition, while S2 is responsible for membrane fusion and is essential for cell adhesion and immune protection [159]. When S1 binds to the ACE2 receptor of the host cell, S1 will be shed from the S protein, allowing the virus to fuse to the host cell membrane using S2 [98, 160]. Interestingly, the S protein of SARS-CoV-2 has been reported as a major target for neutralizing antibodies as well [104]. Hence, it remains a critical question whether glycosylation of S proteins in coronaviruses allows adequate exposure of viral protein epitopes or acts as a fundamental role in immune evasion.

Emerging experimental evidence, as well as bioinformatic analysis, has pointed out that spike protein is heavily glycosylated. According to sequence features, S protein accommodates at least 22 potential N-glycosylation sites (hence with each trimer presenting 66 N-linked glycosylation sites) and at least 3 mucin-type O-glycosylation sites [161,162,163,164,165]. Eight sites (17, 61, 74, 122, 149, 165, 234, 282) are in the N-terminal domain of the S protein, two (331 and 343) in the RBD, two (1098 and 1134) in the connector domain, and the other sites are outside the functional domains [163]. Three O-glycosylation sites are located within the S1 subunit at Ser673, Thr678, and Ser686 residues [161, 165]. Notably, Shajahan et al. identified an unreported O-glycosylation site at Thr323 and another possible one at Ser525 of the RBD, which may play a key role in viral binding to its cellular receptor ACE2 [166]. Sanda et al. identified eight additional O-glycopeptides near the furin cleavage site of the spike glycoprotein for the first time [165]. These O-glycosylation sites are thought to protect S protein epitopes or key residues from immune system attack and have an important contribution to the immune escape of viruses [161, 166, 167]. Whether all glycosylation sites are glycosylated simultaneously and constantly, or affected by the host environment and other factors, remains an open question to be explored.

The glycan composition that occurred on glycosylation sites is another critical issue to understand the virus–host interaction. Using mass spectrometry-based glycoproteomics, it is found that the glycans on spike protein mainly are high mannose, hybrid, and complex glycans [162,163,164,165,166, 168]. Specifically, eight N-glycosylation sites are predominantly high-mannose-type and the other sites are primarily complex glycans [163, 166]. The two significant high-mannose-type sites (containing more than 80% high mannose) on S protein are N234 and N709 [163]. The major high-mannose-type glycan structure, except N234 with Man9GlcNAc2, at other seven sites on the S protein is Man5GlcNAc2 (Man, mannose; GlcNAc, N-acetylglucosamine), making these sites susceptible as substrates for α-1,2-mannosidases but not for reaction with GlcNAcT-I, and thus unable to be processed into the hybrid- and complex-type glycans [163]. Moreover, the glycan composition and occupancy of the respective sites may be different when S1 and S2 are expressed separately [166]. What factors contributed to the divergence and how they will affect the molecular function remain a question.

Heterogeneity of glycosylation of SARS-CoV-2 Spike protein is reported [163, 164, 166]. Species and quantities of N-glycan on S protein differ when isolated from different host cells [164]. Zhang et al. predicted that native N-glycosylation processing of the S protein of SARS-CoV-2 produces mature glycans that should be identical to recombinant proteins expressed in human cells [164]. Miller et al. evaluated the glycosylation-related heterogeneity of the N-Glycosylation site and showed heterogeneity in the degree of glycan processing of S proteins, with some trimers being more processed than others. This heterogeneity may have a role in confusing the host immune system [162]. Shajahan et al. characterized quantitative N-glycosylation profiles of S proteins and executed an extensive manual interpretation strategy to enrich the data on N- and O-glycosylation to confirm the complexity of glycosylation in SARS-CoV-2 [166]. Furthermore, several studies indicated the glycosylation sites on S protein are relatively conserved during the rapid global spread [161, 167, 169, 170]. This conservation indicates its critical functions in the virus life cycle and host adaptation. The binding of the S protein to the ACE2 receptor is mainly due to the interaction of polar residues between the RBD and the structural domain of ACE2 [11, 99, 169]. Trimeric, complex glycosylated S proteins give them an advantage over monomeric and immature glycosylated variants for receptor binding [171]. The hinge-like dynamic movement of the RBD on the S protein occurs intensifies the affinity of the RBD to ACE2 up to 10–20 times, which partly explains the high viral transmissibility [98, 99, 172]. In contrast, if the biosynthesis of N-glycans is blocked at the oligomannose stage, or if the synthesis of O-glycans is blocked, this will increase the breakdown of spike-protein and thus reduce the possibility of viral binding to ACE2 [173].

4 Limitations and Perspectives

To extend the comprehension of the SARS-CoV-2 infectious mechanism and pathogenesis for a more rapid and sensitive diagnosis, we reviewed MS-based proteomics and glycoproteomics and their application in finding COVID-19 biomarkers from bodies fluids of patients with different stages or preconditions. In the face of the aggressive outbreak of COVID-19, it is urgent to develop a speedy and precise diagnosis of SARS-CoV-2 so that appropriate medical measures can be deployed at an early stage of infection. Since the outbreak of the COVID-19 pandemic, LC–MS-based proteomics and glycoproteomics have been used primarily to find biomarkers to identify SARS-CoV-2, approve drug targets, assess medical efficacy, or elucidate molecular mechanisms of pathogenesis and disease severity [174]. Although SARS-CoV-2 shows signs of mutation and the complexity of the pathogenesis and symptoms of COVID-19 will change accordingly, MS-based histological techniques still have great advantages to help understand the prevalence of COVID-19 [175].

A great challenge is the limited number of cases used for the biomarker study. Given the great variations of demographics of patients (age, gender) and pre-existing diseases, such as immunosuppression, chronic renal insufficiency, obesity, and diabetes, a larger number of samples should be collected to eliminate the effects of these biases and to reveal a true clinic biomarker. In the research containing 1099 confirmed COVID-19 patients, 23.7% (261) patients were diagnosed with at least one comorbidity, demonstrating the implication of pre-existing conditions for the COVID-19 [176]. Another challenge is the sampling time of the patients. Due to individual differences, the time of disease onset may vary from patient to patient, and consequently, when comparing samples, we may be comparing patients at different stage of disease progression, which makes the results more confusing. Therefore, both cross-sectional and longitudinal analyses of patient samples are needed to have a thorough understanding of the disease progression. Ideally, a combinatory approach comparing different patients at the same stage and the same patient at a different stage of COVID-19 would greatly advance our understanding of which protein(s) decide the disease progression.

In the study of proteomics, contemporary MS has been tremendously developed and its resolution can reach even up to 100,000 [177]. Even so, the dynamic range of MS detection is still limited for the profiling of complex samples. For example, to analyze plasma proteins from COVID-19 patients, the challenges posed by the large range of plasma protein concentrations need to be overcome. Commonly deployed strategies include (I) removal of high amount proteins, especially the albumin and immunoglobulins within the plasma, to avoid covering less abundant ones, (II) fractionation of plasma proteins by chromatography, gel electrophoresis, or other means to reduce complexity, and (III) separation of target groups of proteins or peptides of interest using strategies such as the ELISA. Proteins may contain multiple glycan modification sites, and the glycan chain type (O- or N-glycan) and glycan chain occupancy may be different at each site (macro-heterogeneity), while multiple different glycan chain structures may be contained at a single site (micro-heterogeneity) [178]. The macro- and micro- heterogeneities of glycan chain structures need to be resolved one by one in glycoprotein structural analysis, so the establishment of efficient, highly specific, and sensitive methods for the enrichment and analysis of glycosylation modifications on a scale is the key to the in-depth study of glycosylation.

At the time when COVID-19 is ravaging the world, rapid detection of the virus is an elemental strategy to control the sprawl of the disease. Multi-omics for biomarker discovery and altered molecular network investigations are dominant and valuable tools for a more comprehensive overview of COVID-19. In particular, proteomics and the emerging glycoproteomics can provide gene expression and post-transcriptional information that, when applied to the search for biomarkers of COVID-19, can contribute to a deeper exploration of the infection and pathogenesis at molecular levels, and ultimately uncover therapeutic strategies. MS-based proteomics and glycoproteomics have the advantage of being high-throughput and non-biased and are exceptionally promising to open up key strategies to defeat the COVID-19 pandemic.