Introduction

A rare disease (RD) affects a small number of people [1]. According to the Rare Disease Act of 2002, a RD “affects fewer than 200,000 people in the USA” [2], whilst in Europe, a disease is considered to be rare when it affects no more than one person in 2000 [3]. An operational definition would consider the specific clinical and qualitative challenges associated with the low prevalence of rare diseases. Rare Diseases International is working with a multi-stakeholder group of experts to develop an international operational definition of rare diseases as part of its Memorandum of Understanding with the World Health Organization [4].

Close to 8000 RDs are known and about 80% have a genetic origin [5, 6]. Hence, when jointly considered, RDs are not so rare. A recent analysis yielded a conservative, evidence-based estimate for the population prevalence of rare diseases of 3.5–5.9%, corresponding to 263–446 million persons affected globally at any point in time [7]. One in 17 people will be affected by a RD at some point in their lives [8]. This amounts to 30 million people across Europe and 25 million in the United States of America (USA) [9]. Despite the many efforts of the medical, research, and patient communities, most RDs lack effective treatment [10].

This article focuses specifically on congenital disorders of glycosylation (CDG), first reported clinically in 1980 by Jaeken et al. [11]. These are rare genetic disorders caused by pathogenic variants in the genes that code for proteins needed for the glycosylation and deglycosylation processes (building glycan trees and attaching them to proteins and lipids). Approximately 170 different CDG have been described [12, 13]. They are named by the official gene symbol followed by the suffix “-CDG”. They can be classified according to the underlying affected glycosylation pathways: protein N-glycosylation, protein O-glycosylation, glycosphingolipid synthesis, glycosylphosphatidylinositol (GPI)-anchor synthesis, and other glycosylation pathways. In N-glycosylation, the assembling of the glycan occurs in the cytosol and the endoplasmic reticulum, and the remodelling in the Golgi apparatus. O-glycosylation mainly occurs in the Golgi apparatus. It has no processing and thus consists only of the assembly, allowing for a more variable set of O-glycans. In some CDG, there is a combined defect in N- and O-glycosylation. Glycosphingolipids consist of membrane lipids linked to a glycan. GPI-anchored proteins are localised in the plasma membrane [14,15,16]. N-glycosylation defects can be divided into CDG type I (CDG-I) and CDG type II (CDG-II). CDG-I defects impair the biosynthesis and attachment of the lipid-linked oligosaccharide to proteins, thus generating proteins which have some unoccupied glycosylation sites. CDG-II defects impair the glycan remodelling, thus generating defective glycoproteins [17].

According to the Centers for Disease Control and Prevention [18], epidemiology consists of the methods used to understand the causes and progression of diseases and health outcomes in populations. It is the study of the distribution of these determinants, and this can be used to control health problems and their implications for society.

Two important epidemiologic tools are prevalence and incidence: Prevalence refers to the ratio between the number of patients with a particular disease in a population and the number of people in that population at a given moment. Incidence refers to the number of individuals who develop a specific disease during a particular time period. The key difference between these two epidemiologic measures is that incidence only considers new cases, while prevalence includes new and pre-existing cases. These terms are often confused with “frequency”, which is the number of events in a certain time [18].

Therefore, generating epidemiological data on CDG, and other RDs, is fundamental to:

  1. 1)

    Evaluate the burden of disease

  2. 2)

    Identify unmet clinical needs. For instance, epidemiological data contributes to creating guidelines for patient management and follow-up. Clinical management guidelines are essential to ensure that data is systematically collected and that patients receive a uniform, high-quality care [7]

  3. 3)

    Expedite drug approval, by supporting the designation of new drugs as “orphan drugs”: One criterium for a drug to obtain an orphan drug designation by the Food and Drug Administration (FDA) [19] and European Medicine Agency (EMA) [20] is adequate documentation on relevant epidemiological data (surveys, cohort studies, patient registries, databases, etc.) demonstrating that the intended condition is rare

  4. 4)

    Identify eligible target populations for therapies prior to being marketed, since relevant epidemiological data is a limiting factor for clinical trial development

This review aimed to assemble and summarise published epidemiological data on CDG and to report the main obstacles to research in this area.

Methods

A set of keywords related to epidemiology and CDG were defined (Table 1S). A customised Python script (tj_articles_extraction - https://github.com/tatianarijoff/tjbioarticles) was used to combine the keywords and search the MEDLINE database, using PubMed as the search engine. The script retrieves the corresponding MEDLINE data (title, abstract, MeSH terms, etc.) from each article from the XML and exports them to a Pandas DataFrame. Using the LaTeX Python library, we generated PDF files containing the title, year of publication, and abstract of the articles selected by the script and eliminated duplicate entries.

We identified 628 article abstracts and 115 unique abstracts were extracted.

Inclusion and exclusion criteria selected: (1) only articles that included information regarding incidence, prevalence, number of patients, and epidemiology; (2) that were related/relevant to CDG; (3) written in English; and (4) no reviews, opinion articles, and other types of articles, such as short/brief communications and letters. The flowchart for study selection is shown in Fig. 1. Of the 115 unique records selected, sixty-three articles were reviewed. An additional 101 articles were identified through authors’ referrals and screening of the extracted reviews’ references.

Fig. 1
figure 1

PRISMA flowchart of study selection. *Some articles were excluded for more than one reason

We extracted information from the 165 included articles regarding their study aim, disease(s) addressed, incidence, prevalence, number of patients in the total population, and significant findings used in the results section of this article. In their cohort studies, several studies involving patients from different countries could not represent the total national population. As some patients may have been reported in different studies, only studies that explicitly stated that they report the total number of patients per country were considered. Seaborn (https://seaborn.pydata.org/, a Python data visualisation library) [21, 22] and Excel were used to generate the figures used in this manuscript.

Results

Characteristics of the included studies

The 165 articles reviewed fall apart into five groups: (i) 81 case reports; (ii) 43 patient cohort studies which reported on the frequency of CDG symptoms [23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]; (iii) 11 papers [38, 39, 42, 64,65,66,67,68,69,70,71] on the frequency of carriers of specific pathogenic variants, most of European origin, and prevalence of CDG; (iv) 9 papers [65, 72,73,74,75,76,77,78,79] on pathogenic variant allelic frequency, mostly describing PMM2 pathogenic variants; and (v) 8 other original articles, which did not focus on epidemiological data, but on data (mechanisms of disease; 3D modelling of the proteins associated with CDG, etc.), reporting information relevant to the present work [62, 80,81,82,83,84,85,86]. Three questionnaires were also included in this review [25, 26, 69].

These 165 studies reported data from 38 different countries, with the largest number of publications (n = 31) from the USA, followed by Belgium (n = 15), France and the Netherlands (n = 12 for each), and Germany (n = 11) (Fig. 2). Only the most recent data was included in this literature revision.

Fig. 2
figure 2

The geographical distribution of the publications included in the literature review. Articles whose corresponding authors were from different countries were counted for each country

Of the 170 different described CDG, 93 were included in this review. The most used method to screen for/diagnose CDG was transferrin isoelectric focusing (TfIEF) (n=11) [20, 23, 31, 36,37,38,39, 42, 45, 49, 50], followed by polymerase chain reaction (PCR) (n=5) [64, 71, 72, 74, 76], whole exome/genome sequencing (WES/WGS) (n=6) [27, 35, 38, 39, 69, 75], Sanger sequencing (n=3) [24, 27, 35], sequence-specific primers (SSP) [64, 76], ELISA [71], single nucleotide polymorphisms (SNP) and short tandem repeat (STR) genotypic analysis [74], and/or other studies [30, 36, 67, 72, 78, 87]. CDG prevalence in the total population was calculated mostly by dividing the total number of reported patients by the number of the total population, whilst disease frequency based on allele frequency was mostly assessed using the Hardy-Weinberg equilibrium. Seven studies presented an estimated prevalence that was determined with 95% confidence interval (CI) [38, 42, 47, 68, 71, 72, 88]. Most of the studies covered data regarding CDG patients in Europe, as most reported patients had a European origin.

Summary of demographic and clinical characteristics

Table 1 summarises the total number of CDG patients in the included studies. A total of 3057 CDG patients were considered. Additionally, a comparison between the reported CDG patients and a systematic survey of literature carried out by Orphanet in January 2022 is presented [89]. In Fig. 1S, the data presented in Table 1 is organised in a pie-chart format.

Table 1 Reported number of CDG patients by type

Figure 3 shows the number of reported CDG patients in different countries by continent. Some case reports did not report the patients’ nationality but their ethnicity. Europe was the continent with the highest number of reported CDG patients (n=618), followed by Asia (n=416), America (n=243, approximately 190 from the USA), and Africa (n=22). Additionally, two XYLT2-CDG Australian patients [167] were considered as well. Japan had the highest number of reported patients, with approximately 215 (including 207 FKTN-CDG patients), followed by the USA (n=188), Italy (n=110), France (n=103), Spain (n=97), and Saudi Arabia (n=68).

Fig. 3
figure 3

Number of reported CDG patients by continent until October 2021. The patients are divided by country and ethnic groups. The USA axis includes patients of different ethnicities. Delgado et al. [48] did a cohort study on Argentinian and Chilean EXT1/EXT2-CDG patients but did not specify the patients’ nationality. Geis et al. [49] reported 35 POMT1-CDG patients from 27 independent families (16 families of Turkish origin; 8 of German origin; and, individually, 1 with Indonesian, Gipsy, and African origins). The authors did not report the nationality of the patient with African parents

The data from 521 patients collected from case reports is summarised in Table 2S. A total of 161 of them had their age at diagnosis available. It ranged from prenatal diagnosis [51] to 47 years [132]. The average age at diagnosis for these patients was 86.8 months (± 7.2 years).

In their patient cohort, Pérez-Cerdá et al. observed that the CDG patients born before 1996 had an average age at diagnosis of 13.4 ± 4.3 years, whilst in the patients born after 1995, it was 2.3 ± 2.4 years [41].

Four studies reported the mean or range of age at diagnosis [25, 29, 47, 93], showing a propensity for young patients. For instance, Francisco et al. [25] reported that 74% CDG patients were diagnosed before 5 years of age, and in 2000, 54% of diagnosed patients were below 10 years of age [168]. Whilst 37.9% of non-PMM2-CDG patients were diagnosed after six, only 17.2% of PMM2-CDG were diagnosed at the same age range [25]. Witters et al. [29] and van de Loo et al. [93] reported the same age range for the diagnosis of PMM2-CDG and other CDG, respectively: between 2 weeks and 21 years.

The mean age at diagnosis of EXT1-CDG was 3 years [47]. Except for X-linked CDG, no difference was observed between the incidence of CDG in male and female patients [43, 54, 157].

The mortality rate in infants with PMM2-CDG was 20% [25, 36]. A similar rate (18%) was observed in Portuguese PMM2-CDG patients aged 1 to 12 months [65].

Fourteen articles besides providing data on incidence also shared information about the clinical manifestations and disease progression [23,24,25,26, 28,29,30, 32, 33, 35, 36, 42, 93, 169]. Six of these articles reported symptoms and disease progression of PMM2-CDG patients.

Epidemiological data

From the overall number of reported CDG patients in Table 1, the five most frequently found CDG are PMM2-CDG (32.7% of the total number of CDG patients included in this revision), followed by FKTN-CDG (6.5%), EXT1/EXT2 (3.7%), ALG6-CDG (3.3%), and PIGA-CDG (2.9%).

The total number of diagnosed European CDG patients might exceed 2500, making the prevalence of CDG in Europe around 0.1–0.5:100,000 [87].

The birth incidence of PMM2-CDG ranges from 5:100,000 to 0.06:100,000 births worldwide [68, 71]. The frequency of PMM2-CDG in patients under 18 years old has been estimated in the Estonian population to be 1: 79,000 [38]. The estimated frequency of PMM2-CDG ranged from 50:1,000,000 [71] in the total Dutch, Flemish, and Danish populations to 0.4:1,000,000 in Poland [68]. The expected carrier frequency of PMM2-CDG in the Saudi Arabian and Turkish was considerably lower compared to European populations (1.4×10−5 and 3.5×10−6, respectively) [42, 69] (Table 2).

Table 2 Prevalence of different CDG was reported in diverse populations based on allele frequency and total population

MPI-CDG, ALG9-CDG, COG5-CDG, COG6-CDG, and COG8-CDG are extremely rare in European Americans and were predicted to be considerably more prevalent in African Americans [169]. Despite that, a total of thirty-six MPI-CDG patients have been reported.

In the following subsections, we provide more detailed information on seventeen CDG: PMM2-CDG, FKTN-CDG, EXT1/EXT2-CDG, ALG6-CDG, PIGA-CDG, GALNT3-CDG, SLC35A2-CDG, ST3GAL5-CDG, DPAGT1-CDG, GMPPA-CDG, ALG12-CDG, ALG8-CDG, GALNT14-CDG, GALNT5-CDG, MPDUI-CDG, RFT1-CDG, and SRD5A3-CDG.

PMM2-CDG

Twenty-two publications exclusively reported epidemiologic data on PMM2-CDG [31, 35,36,37,38, 42, 46, 71, 74,75,76,77,78, 88, 105, 168, 189,190,191,192,193,194]. PMM2-CDG accounts for most of the published CDG patients, reaching 62% in 2018 [87]. There are at least 1000 reported PMM2-CDG patients, but the number of diagnosed patients is undoubtedly much higher [38].

The most common compound heterozygous pathogenic genotype is P113L/R419H [29], and the most frequent pathogenic variant is R141H, a missense variant [25, 29, 35, 39, 46, 71, 72, 74, 75, 77, 87, 168, 195] (Table 3).

Table 3 Allelic frequency of the heterozygous R141H missense pathogenic variant

The geographic distribution and frequency of the different pathogenic variants varied greatly in the European countries, except for the frequent R141H variant. The latter was reported from all countries and ranged from 20.5% (in Portugal, 2021, based on allelic frequency) [65] to 86.4% (in Denmark, 2000, estimation based on haplotype analysis) [71] of all pathogenic variants (Table 3). The high incidence of R141H in European individuals might be due to preferential transmission (transmission ratio distortion) and a geographic origin event [74]. Evidence for a transmission ratio distortion was indeed found: there is an increased recurrence risk in PMM2-CDG families ranging from 38% in Scandinavian families to 34% in other European families (2004) [77]. On the other hand, haplotype analysis of patients with this variant and from different geographic origins suggested that R141H is an ancestral variant in the Caucasian population, representing an old and single event derived from linkage disequilibrium with other alleles [71].

The R141H/R141H genotype has never been reported. This can be explained by its lethality, as the enzymatic activity of the recombinant R141H protein is null [77].

A founder effect for the missense F119L variant, the second most common pathogenic variant amongst the Scandinavian population, was found in Southern Scandinavia [72, 168]. Denmark, Sweden, the Netherlands, and Belgium, mentioned in descending order, reported the highest number of patients per number of inhabitants. In Denmark, it accounted for 48% of the disease alleles. Moving southward, this percentage gradually decreased: F119L varied between 17% in the Netherlands and 11% in Germany. This pathogenic variant has not been found in Southern European countries [168]. The preferential transmission was also observed for F119L amongst Scandinavian patients as this was suggested to be related to reproductive advantage at the stage of gametogenesis, fertilisation, implantation, or embryogenesis [77].

D65Y is an Iberian founder missense pathogenic variant [65]. Alongside R141H, it is the most prevalent pathogenic variant found in Portugal and Spain (19.3% and 18% of PMM2-CDG patients, respectively) [65, 74].

Additionally, V44A, a missense pathogenic variant, probably also originated in the Iberian Peninsula, as it has only been reported in Portuguese and Latin American patients [78].

By 2000, the E139K missense variant had been reported in four French patients, and it is likely a founder pathogenic variant of French origin [168].

The Italian population’s second most common pathogenic variant is the L32R missense pathogenic variant (16% of disease alleles). Besides L32R and R141H variants, 14 others were found [168].

The missense pathogenic variant V231M was commonly found amongst Polish and Estonian PMM2-CDG patients (21% and 23%, respectively) [38, 68].

Amongst the Turkish population, V231M was the most prevalent pathogenic variant (2020). The low prevalence of PMM2-CDG in the Turkish population can be explained by the fact that the R141H allele is rare in the Turkish population (3.81×10−4 vs 3.97×10−3 in Public DataBases). The total frequency of likely pathogenic alleles in the Turkish population was 1.91×10−3, and it was considerably lower than allele frequencies reported before in Public DataBases. It should be pointed out that the rate of consanguineous marriages in Turkey is high (23%), increasing the risk for homozygous variants [42].

Most of the pathogenic PMM2-CDG variants have been reported in Europe. This might be explained by its less intensive study in other continents.

FKTN-CDG (FCMD)

FKTN-CDG, also known as Fukuyama congenital muscular dystrophy (FCMD), has a high incidence in Japan (0.7–1.2 per 10,000 births) [60]. It is caused by a founder pathogenic variant in the 3′-UTR of the fukutin (FKTN gene) within the Japanese population [121, 196]. A Japanese national registry reported 207 patients by 2013 [55].

EXT1/EXT2-CDG

EXT1/EXT2-CDG is one of the few dominant CDG. They are also known as hereditary multiple exostoses (HME) or hereditary multiple osteochondromas (HMO); so reports prior to the new nomenclature proposal [154] would not be identified through our set of keywords.

More than 110 patients have been reported [48, 58]. The estimated incidence of EXT1/EXT2-CDG patients ranges between 5.56×10−5 and 2×10−5. The highest estimated incidence has been identified in Pauingassi, a Canadian Indian community (1:77) [180], followed by a closed sub-population of the Chamorros population from the island of Guam (1:1000) [57]. In Europe and North America, the estimated incidence is 1:50 000 (2×10−5) [114]. EXT1 variants account for 56 to 78 % of all EXT1/EXT2-CDG patients, whilst variants in the EXT2 gene account for 20.5 to 44% [58].

ALG6-CDG

Three publications exclusively reported epidemiologic data on ALG6-CDG [67, 197, 198]. Hundred one patients with ALG6-CDG have been reported, making it the second most frequent CDG-I, following PMM2-CDG [87]. The frequency and the prevalence of ALG6-CDG in the global population are not known. Almost all reported ALG6-CDG patients were found in Europe [197]. Some patients were diagnosed in South Africa as descendants from European colonists [198].

Half of the ALG6-CDG patients showed the homozygous A333V missense variant. Twenty different ALG6 pathogenic variants have been described, with the c.998T variant resulting in the A333V substitution making up most of the alleles [64]. Another frequent variant is the missense L453V variant, with an allelic frequency of 0.012 [73].

ALG6 missense variant Y131H occurs at a frequency of 0.021 in the general North American population. Based on the allelic frequency, the birth rate of homozygotes is predicted to be 4.55×10−4. It was suggested that if this homozygous variant causes CDG, it may not be detected by serum TfIEF and thus could be missed [67].

PIGA-CDG

PIGA-CDG, also known as multiple congenital anomalies-hypotonia-seizures syndrome type 2 (MCAHS2), is an X-linked recessive disease [199]; thus, symptoms have only been reported in male patients [43]. As of 2020, 88 patients have been reported [59].

GALNT3-CDG

GALNT3-CDG is also known as familial hyperphosphatemic tumoral calcinosis (HFTC). At least 66 patients from 42 different families have been reported [83, 125, 126].

The incidence is highest in patients of African origin, followed by white patients with Middle Eastern origin. One Chinese GALNT3-CDG patient has been reported [125].

SLC35A2-CDG

According to our literature search, 62 SLC35A2-CDG patients have been reported [54, 153]. It is an X-linked dominant disease, making the phenotype more aggressive in male patients (with a higher mortality rate in males than females) but a higher incidence in females than in males. The frequency and the prevalence of SLC35A2-CDG are not known. Seven per cent of the CDG-II patients have been diagnosed with SLC35A2-CDG [54].

ST3GAL5-CDG

Fifty ST3GAL5-CDG patients have been reported; 38 of them are Amish (the largest Amish populations are found in the states of Pennsylvania, Ohio, and Indiana), due to a founder pathogenic missense variant (c.862C > T, R288X), with a predicted incidence of 1:1200. The reported non-Amish patients have been found inside and outside the USA: 4 African American, 3 Pakistani, 2 French, 2 South Korean, and 1 Iranian [56].

DPAGT1-CDG

Forty-five DPAGT1-CDG patients have been reported and around 46% (13/28) are female (Table 2S). In Ng et al. cohort study, the mortality in infants was 30%, making DPAGT1-CDG more severe than PMM2-CDG, in which 20% of the patients with the most severe symptoms died before the age of 1 [63]. Most of the pathogenic variants associated with DPAGT1-CDG are missense (Table 2S). Pérez-Cerdá et al. have reported eight Spanish patients [41].

GMPPA-CDG

GMPPA-CDG has been identified in 21 patients. The homozygous GMPPA frameshift pathogenic variant L89fs is likely a founder variant originating from the Maya-Mam population (Guatemala, Central America) [132].

Other CDG

In Table 4, we clustered information on other CDG, regarding frequent pathogenic variants. It should be clarified that the pathogenic variants grouped in Table 4 do not represent all the reported pathogenic variants, but their allelic frequency was documented in the articles reviewed in this literature revision.

Table 4 Allelic frequency of pathogenic variants of ALG1, ALG12, ALG2, ALG3, ALG8, ALG9, DDOST, GALNT14, GALNT15, MPDUI, RFT1, and SRD5A3 genes

Discussion

We found 165 papers on epidemiology data in CDG, focusing mainly on incidence, prevalence, and allelic frequency.

The prevalence and carrier frequencies of most CDG are unknown. Few articles provided information on the incidence and prevalence of CDG, as there is only information regarding the prevalence of PMM2-CDG for nine countries (Denmark, Estonia, Flanders (Belgium), France, Netherlands, Poland, Saudi Arabia, Sweden, and Turkey) [38, 42, 68, 71, 72, 88]; EXT1/EXT2-CDG for Bulgaria, Pauingassi (Canadian indigenous people), Chamorros (indigenous people of Guam), the UK, and the USA [47, 57, 70, 114, 180]; CDG-I for Sweden [72]; CDG (non-specified) for Poland, Saudi Arabia, and the USA [68, 69, 188]; and the birth prevalence for FKTN-CDG in Japan [60]. Although 42 cohort studies were identified, none reported patients from sub-Saharan African countries, Russia, and China. As these are huge countries, a high underdiagnosis of CDG is expected.

Excluding Japan, with more than 210 patients, 207 with FKTN-CDG [55], the country with the highest number of reported CDG patients was the USA [188]. This may be due to the existence of research consortiums (such as Frontiers in Congenital Disorders of Glycosylation [200]) and internationally referenced clinics, facilitating both the diagnosis and a closer follow-up of these patients. Besides the size of the country, the USA is ethnically diverse. Therefore, a large portion of the population might originate from regions with a high prevalence of some CDG (either by a founder effect or a high carrier frequency of certain pathogenic variants).

Although some 160 CDG-causing genes associated with about 210 distinct clinically defined phenotypes have been reported [12, 13], we present epidemiological data on only 93 CDG. The number of patients of the remaining CDG is too small or the reported information is too scarce for an epidemiological study [44, 85, 86, 97, 112, 119, 120, 133, 155, 157]. This underlies CDG heterogeneity, patient geographical dispersion, and the likelihood that many patients are still unreported.

Regarding the most frequent CDG, PMM2-CDG, Denmark, Netherlands, and Flanders (Belgium) are the regions with the highest prevalence of PMM2-CDG described so far (1:20,000) [71]. The European country with the highest number of reported PMM2-CDG patients is France (n=96) [40], followed by Spain (n=71) [41], Portugal (n=39) [65], Italy (n=37) [46], Denmark (n=22) [71], the Netherlands (n=19) [71], Poland (n=17) [68], the UK (n=14) [50], Germany (n=12) [71], Belgium (n=9) [71], Estonia (n=5) [38], Ireland (n=2) [71], and Switzerland (n=1) [71].

Note that the number of reported Dutch, Belgium, and Danish patients and the prevalence of PMM2-CDG in those countries were estimated in 1998, and since then, there is no or insufficient update of the actual prevalence.

The prevalence of CDG in Asia is certainly underestimated due to a lack of data. Japan, Saudi Arabia, and Turkey have the most reported CDG patients, with 214, 68, and 54, respectively [33, 42, 55, 60, 69, 121, 122, 126, 151, 166]. The most reported CDG in Turkey is POMT1-CDG (n=18) [49] and PMM2-CDG (n=11) [42]. The expected carrier frequency of PMM2-CDG in the Saudi Arabian and Turkish populations was lower than in European populations. We did not find reported PMM2-CDG patients from the following countries: Bahrain, India, Iran, Iraq, Israel, Kuwait, Lebanon, Pakistan, Palestine, South Korea, UAE, and Vietnam. Additionally, no CDG had a particularly high incidence in these countries. Some Middle Eastern and West Asian countries such as Iran, Iraq, Pakistan, Saudi Arabia, and Turkey have a high incidence of consanguineous marriages [33, 42, 144, 145, 156, 160, 165, 166], increasing the rate of autosomal recessive diseases. Indeed, excluding Japan, these countries reported the highest number of CDG patients in Asia [49, 90, 99, 100, 102, 110, 112, 120, 121, 123, 127, 132, 133, 139, 144, 156, 158,159,160, 166, 167, 189, 192, 196, 197, 201,202,203,204,205].

The South American country with the highest number of reported CDG patients is Argentina [48, 97, 98]. There may be underdiagnosis in other South American countries, but the high influx of European immigrants (and thus, genetic import) in the last century could also have contributed to this observation in Argentina. This also explains also why the R141H PMM2 variant was prevalent amongst these patients [97]. Furthermore, the only Latin American centre specialising in CDG is found in Argentina (https://cemeco.fcm.unc.edu.ar/), which implies a bias towards the diagnosis of Argentinian patients.

ALG6-CDG is considered the second most common CDG [87], but in our review, it fell behind EXT1/EXT2-CDG [48, 58] and FKTN-CDG [55]. Regardless of the explanation, this may well correspond to the reality as EXT-CDG patients are often not considered as a CDG because they are monosymptomatic whilst most CDG are polysymptomatic. The incidence of FKTN-CDG in Japan is 0.7–1.2 per 10,000 births [60]. This high incidence is caused by a founder FKTN pathogenic variant. Traditionally, FKTN-CDG is classified as congenital muscular dystrophy (CMD). However, fukutin deficiency as a CDG surpasses ALG6-CDG as the second most common CDG.

In 2021, Pajusalu et al. [206] performed a statistical analysis to estimate the prevalence of 27 N-linked CDG across different populations. The estimation was based on allele frequencies disclosed in gnomAD (https://gnomad.broadinstitute.org/) and ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/). The expected prevalence differed from the one available in previously published literature. Regarding PMM2-CDG, the group reported that the Finnish population has the higher prevalence (1:18,745), followed by Ashkenazi-Jews (1:19,908) and non-Finnish European populations (1:27,465). ALG6-CDG was more prevalent amongst non-Finish Europeans (1:623,512). Amongst East Asian populations, MAN2B2-CDG and FUK-CDG have a high prevalence (1:11,323 and 1:12,248, respectively). ALG1-CDG is more prevalent amongst Ashkenazi-Jews (1:47,656), compared to other populations. The other N-linked CDG had a considerably lower prevalence amongst the different populations.

Concerning the diagnosis and screening of CDG, the most used method for screening congenital disorders of N-linked glycosylation is TfIEF. Since transferrin carries sialylated N-glycans, there is a cathodal shift caused by a partial deficiency of sialic acid in CDG. Normal serum transferrin is mainly constituted of tetrasialotransferrin with negligible small amounts of mono-, di-, tri-, penta-, and hexasialotransferrins. Abnormal TfIEF results, caused by the cathodal shift, can be divided into types 1 and 2—in type 1, there is an increase of both disialotransferrin and asialotransferrin and a decrease of tetrasialotransferrin, whilst type 2 is also characterised by an increase of triasialotransferrin or monosialotransferrin [14, 207].

Although this technique is one of the most applied whilst screening for CDG, a normal TfIEF result does not exclude CDG, as congenital disorders of O-glycosylation, (GPI)-anchor synthesis, and other glycosylation pathways are overlooked. In addition, other inborn errors of metabolism, like hereditary fructose intolerance (HFI) [208] and classic galactosemia [209], have also been associated with abnormal TfIEF results.

The identification of CDG and other RDs through genome sequencing aids in understanding the subjacent pathophysiologies, which in turn helps to target therapies [210].

NGS methods like WGS and WES applied in large population studies are an excellent tool for estimating a more real prevalence of CDG and other RDs. These methods remove the necessity to prioritise candidate genes for sequencing. Genome-wide association studies using data from thousands of individuals, Biobanks, and population samples aid to identify genetic variations associated with a particular disease.

Currently, WES is the major tool of CDG diagnosis. It is cost-effective, as it has a higher yield of relevant gene variants compared to WGS—although the human exome represents less than 2% of the genome, approximately 85% of known disease-related pathogenic variants occur in exons [211].

Limitations and strengths

The primary limitations faced whilst conducting our review were the diversity of the data, which was derived from (1) CDG multiplicity and heterogeneity; (2) the differences between studies (study population(s); information sources; study design; etc.) and the duplication of patients in different studies; and (3) a lack of uniform, accepted standards for the collection and organisation of CDG epidemiological data. Additionally, we did not identify studies from Africa, South-Eastern Asia, and Russia. That gap is also reflected in the fact that we only identified epidemiological data on 93 CDG of the approximately 170 types currently reported. Importantly, the number of new CDG has rapidly grown during recent years, from 128 in 2017 to approximately 170 in 2021 [12], hindering the follow-up of the real number of CDG patients and the actual prevalence of the diseases.

Also, published data inconsistencies were an important limitation. The inconsistent number of reported CDG patients in scientific publications and reports [89] underlines a low level of consistency between publications and registries/health institutes and agencies. This leads to a lack of clarity as for real numbers, impacting both data quality and quantity. In an attempt to centralise information and make it broadly available, we will create a virtual toolkit in the worldcdg.org platform, consisting in a web section dedicated to epidemiology, an ePoster about CDG epidemiology, and the recording of a talk on the subject presented at the 5th World CDG Conference.

Moreover, our search method was a limitation in our work, as it failed to extract a high number of case reports. Indeed, most of the additional included papers by authors’ referrals were case reports. Hence, a search method refinement, namely by improving keywords, should be applied in future studies. Still, our search methodology did not only allow us to identify 3057 CDG patients, but it also brought methodological advances as it promoted the automation of the search process. A literature research using a customised Python script is more time efficient since it retrieves and analyses information faster.

This study solely relied on data from the medical literature. No complementary data sources (such as patient registries and other databases) were used, which limited our analysis. However, the lack of centralised and open CDG data repositories also hindered the possibility of accessing and incorporating such information in this study.

Our literature review provides an overview of what is known regarding the epidemiology of CDG. It highlights the urgency of collecting and grouping epidemiological data with standardised parameters, to create a more realistic estimate of the real number of CDG patients.

The social and economic burden of CDG and other RDs ends up being overlooked, due to the limited evidence on the prevalence of these diseases. Advances in the epidemiological study of CDG and other RDs help to guide healthcare decision-makers in prioritising healthcare policies and clinical management guidelines related to them.

Our review brings awareness on the necessity of creating patients’ registries and databases of high quality to support the target of patients’ populations for clinical trials and the development of orphan drugs.

Conclusion

This is the first literature review compiling published data on the global epidemiology of CDG. This study identified several challenges and gaps regarding CDG epidemiology, with data scarcity, inconsistency, and CDG heterogeneity amongst them. Also, this study red flagged a remarkable lack of uniformity and accepted standards for the collection and organisation of CDG epidemiological data. Higher-quality epidemiological data and more realistic estimates of the actual number of people living with CDG are important in order to target resources for CDG research and drug development, to manage and support public health decision-making. To achieve this, future work should explore this issue further by assembling epidemiological data from databases (ClinVar; LOVD; etc.), patient registries, medical records biobanks, and international collaborations with patient and professional networks.