Introduction

In December 2019, a pneumonia outbreak occurred in Wuhan, China. This disease caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) rapidly spread worldwide and the WHO declared the pandemic of COVID-19 on March 11, 2020 [1, 2]. Globally, as of 2 May 2022, there have been 511,479,320 confirmed cases of COVID-19, including 6,238,832 deaths (data from WHO, https://covid19.who.int/).

Although it is still unclear about the early transmission at the initial of SARS-CoV-2 infection, accumulating information verified that this virus can be transmitted between humans and many subclinical cases exit after intimate contact. Like most viruses, there is a huge variation on the severity of symptoms in patients with SARS-CoV-2 infection [3]. Most patients with SARS-CoV-2 infection have reported mild to severe respiratory symptoms. Adults infected with SARS-CoV-2 usually present fever, cough, dyspnea, and pneumonia. Elder patients with underlying disease or immunocompromised conditions are prone to severe situation such as acute respiratory distress syndrome [4]. However, there are some patients who are diagnosed positive by RT-PCR but are either asymptomatic or minimally symptomatic [5]. In addition, not all individuals exposed to SARS-CoV-2 are infected according to the epidemiological observation of the patients’ close contacts. The sources of these variations are undoubtedly multifactorial, and one possibility lies with genetic variants and variable gene expression of host cells.

Genetic variants play a role in susceptibility to infectious diseases; the global genetic community has been actively investigating on genetic contribution to COVID-19 [6, 7]. New research showed that genetics also plays a role in the severity of COVID-19 [8, 9]. David et al. identified a 3p21.31 gene cluster as a genetic susceptibility locus in COVID-19 patients with respiratory failure and confirmed potential involvement of the ABO blood-group system [10]. A genetic variant in the factor 3 (F3) gene has been identified as a risk factor in severe COVID-19 patients [11]. Another mutation (rs150892504) found in the ERAP2 gene codes for a zinc metalloaminopeptidase protein [12], which is important for the way that antigens are presented via the HLA class 1 binding peptides, thus activating T cells to respond and kill the infected cells [13]. However, whether genetic variants influence the outcome of SARS-CoV-2 infection (asymptomatic or symptomatic) in Chinese population still remains unknown.

In this study, to identify new susceptibility genes that might predispose patients to COVID-19, we conducted the whole exome genome sequencing (GWAS) for COVID-19 in a Han Chinese population, and identified new COVID-19 susceptibility loci that were involved in the immune response and associated with increased risk of SARS-CoV-2 infection.

Materials and methods

Study subjects

From February 1 to May 30, 2020, a total of 256 Han Chinese subjects were consecutively recruited in three hospitals of Chongqing (Chongqing Three Gorges Central Hospital, University-Town Hospital of Chongqing Medical University, and Yongchuan Hospital Affiliated to Chongqing Medical University). Among these individuals, 171 were confirmed cases of COVID-19 (tested positive for SARS-CoV-2) and 85 were close contacts of confirmed patients (tested negative for SARS-CoV-2). Written informed consent was obtained from all individuals involved in the study. This study was approved by the Ethics in Research Committee of Chongqing Medical University in Chongqing, China.

Laboratory confirmation

To identify SARS-CoV-2 infection, nasopharyngeal swabs were collected at least twice and tested by RT-PCR. RNA from all samples was isolated within 24 h. Viral RNA samples were extracted according to the manufacturer’s instructions using the Nucleotide Acid Extraction Kit (DAAN Gene, Registration No. 20170583), based on an automated magnetic bead purification procedure. A commercial RT-PCR kit (DAAN Gene, Registration No. 20203400063) was used to test samples for SARS-CoV-2. Briefly, two target genes, namely open reading frame1ab (ORF1ab) and nucleocapsid protein (N), were simultaneously amplified and tested by RT-PCR. Primers of RT-PCR testing for SARS-CoV-2 were adopted according to the recommendation by the Chinese Center for Disease Control and Prevention. PCR cycling: 50 °C for 15 min, 95 °C for 15 min, 45 cycles containing 94 °C for 15 s, 55 °C for 45 s (fluorescence collection) [14].

Ct values less than 37 and greater than 40 were defined as positive and negative, respectively, for both genes. Samples with Ct values from 37 to 40 were defined as inconclusive, and a second test was needed. Starting 1 week after admission, nasopharyngeal samples were tested by RT-PCR every 2–3 days for the remainder of the hospitalization period. Patients with one positive RT-PCR result were defined as patients with SARS-CoV-2 infection. Patients with two consecutive negative RT-PCR results were defined as SARS-CoV-2 negative [14].

Definitions

A confirmed case of COVID-19 was defined as an individual with consecutive positive nucleic acid tests for SARS-CoV-2 (twice in every 24 h), using laboratory-based RT-PCR.

The symptomatic patients were defined as the patients who were laboratory-confirmed with COVID-19 and presented symptoms such as fever, cough, sore throat, and sputum.

An asymptomatic case was defined as an individual with a positive nucleic acid test result but without any relevant clinical symptoms in the preceding14 days and during hospitalization.

A close contact was defined as anyone who had direct contact with infectious secretions of a COVID-19 patient. Close contact can occur while caring for, living with, visiting, or sharing a healthcare waiting area or room with patients with COVID-19. The duration of shedding was calculated as the number of days from the first positive nasopharyngeal sample to the last positive sample based on RT-PCR testing. The last positive sample was followed by a negative RT-PCR result on two sequential tests [14].

Genetic analysis

Human genomic DNA samples were extracted by the Magnetic Beads Genomic DNA Extraction Kit (Nanjing ZhongkeBio Medical Co., Ltd.) according to the manufacturer’s instruction, and the concentration was measured by a Nanodrop2000cspectrophometer (Thermo Scientific, DE). A total amount of 0.3 μg DNA per sample was required for library generation. Standard instructions from the manufacturer were used for the SureSelectXT Target Enrichment System for Illumina Multiplexed Sequencing (G7530-90000, Qiagen), target capture with the Agilent SureSelect Human All Exon V6 (Qiagen), and 150 bp paired-end sequencing reads on the Illumina NovaSeq platform with bioinformatic processing and variant annotation. Briefly, genomic DNA sample was fragmented by sonication to a size of 350 bp. Then, DNA fragments were end-polished, A-tailed, and ligated with the full-length adapter for Illumina sequencing, followed by further PCR amplification. After PCR products were purified (AMPure XP system), libraries were analyzed for size distribution by Agilent 2100 Bioanalyzer and quantified by real-time PCR (3 nM). At last, DNA library was sequenced on Illumina for paired-end 150 bp reads.

All WES data of individuals included in the initial discovery cohort were analyzed according to standardized GATK4 pipeline. The raw data were aligned to the hg19 human reference genome with the Burrows Wheeler Alignment (v 0.7.17) MEM algorithm. Duplication was marked by the Picard Mark duplicates (v 2.18.0) tool. Base Quality Score Recalibration (BQSR) was performed with GATK tools (v4.1.2.0) before SNP and indel calling was done with Haplotype Caller with interval lists specific to the exome enrichment kit for each sample. gVCF files were combined with Combine GVCFs, and then, genotyping was performed with the genotype GVCF tool implemented in GATK. The cohort call set was filtered with GATK Variant Recalibrator and Apply VQSR. Annotation of the VCF file was performed by a customized version of ANNOVAR. The parameters were based on the GATK Best Practice.

Statistical analysis

Statistical analysis was performed with SPSS (version 22.0, SPSS Inc., Chicago, IL, USA). Descriptive statistics (mean, standard deviation, and percentage) were conducted to reflect the background characteristics of the participants. The χ2 test or Fisher’s exact test was used to analyze categorical variables. A p value of less than 0.05 was considered significant. Plink-1.9 was used in the GWAS analysis performed on the VQSR-passed germline SNPs. For quality control, SNPs with missing rate over 90%, MAF less than 1%, and hwe p value less than 0.000001 were filtered. Samples with DST over 0.85 were considered closely related and filtered. Then, chi-squared association statistics were calculated for each SNP based on default parameters. The Manhattan and QQ plots were drawn using R qqman packages.

Results

Demographic characteristics

The demographic characteristics are described in Table 1. The study included 87 symptomatic patients (G1), 84 asymptomatic cases (G2), and 85 close contacts of confirmed patients (G3), with the mean age of 49.45 years (SD 19.31), 43.55 years (SD 17.03), and 49.76 years (SD 19.08) in three groups respectively. No statistically significant difference was found in age or sex between groups.

Table 1 Characteristics of symptomatic and asymptomatic patients and close contacts

Genome-wide association analysis

Symptomatic patients (G1) vs. close contacts (G3)

We performed a GWA scan in 87 symptomatic patients and 85 close contacts using the Sure Select XT Target Enrichment System for Illumina Multiplexed Sequencing. After the standard quality control (QC) filtering, 194,883 SNPs were included in the follow-up analyses. There was little evidence of inflation of the test statistic (λ 1000 for all invasive analysis = 1.015). We identified eleven risk associations at nine different loci reaching genome-wide significance (effect allele frequency, EAF ≥ 0.05, p value < 1 × 10−5) (Table 2; Fig. 1a). There was no evidence of heterogeneity for these associations. The SNP rs34151785 was the most significant SNP in COVID-19 GWAS (p value = 5.71 × 10−6).

Table 2 Association evidence for 11 SNPs at 9 loci in GWAS (G1 vs. G3)
Fig. 1
figure 1

Manhattan plot. Manhattan plot showing raw p value results from GWAS analysis. Each chromosome is depicted in a different color; the green horizontal line corresponds to the commonly adopted genome-wide significant level at 1 × 10−4. a Summary of genome-wide association results for 87 symptomatic patients and 85 close contacts individuals. b Summary of genome-wide association results for 87 symptomatic patients and 84 asymptomatic cases. c Summary of genome-wide association results for 84 asymptomatic cases and 85 close contacts

Symptomatic patients (G1) vs. asymptomatic cases (G2)

Figure 1b is the Manhattan plots of the COVID-19 GWAS, demonstrating the genome-wide association results for COVID-19. The association between 194,805 genotyped SNPs and COVID-19 severity (87 symptomatic patients and 84 asymptomatic patients) was analyzed, and the green horizontal line represented a p value of 1.0 × 10–4. As shown in Table 3, rs1033323, with a p value of 7.37 × 10−5 in the joint analysis, is located in chromosome region 1q42.2 within the intronic region of the PCNX2 gene.

Table 3 Association evidence for 6 SNPs at 6 loci in GWAS (G1 vs. G2)

Asymptomatic cases (G2) vs. close contacts (G3)

We also identified sixteen loci that showed marginal evidence of risk associations (p value of 1 × 10−5 to 8 × 10−6) in 84 asymptomatic cases and 85 close contacts (Table 4; Fig. 1c). The SNPs, rs17032820, rs2289273, rs2289274, rs4684686, and rs2278554 were located in chromosome region 3p25.3 and associated with COVID-19 infection in joint analysis. These SNPs were located in a linkage disequilibrium (LD) block ATP2B2 gene.

Table 4 Association evidence for 16 SNPs at 9 loci in GWAS (G2 vs. G3)

Pairwise LD patterns in three SNPs of the ERAP1 gene and expression quantitative trait locus (eQTL) analysis

The pair wise LD of three SNPs was generated from the Haploview4.2 software (Fig. 2). The LD block built up by rs27042, rs469876, and rs26618 suggested that they were significantly associated with each other (Dʹ > 0.90). Moreover, they were significantly associated with the mRNA levels of ERAP1 in eQTL analysis (Table 5).

Fig. 2
figure 2

LD patterns (Dʹ plots) of the 3 SNPs in the ERAP1 gene, as generated by Haploview v4.2. The LD block built up by the rs27042, rs469876, and rs26618

Table 5 Functional annotation for SNPs with strong linkage disequilibrium with the marker SNP rs rs469876

Molecular analyses

The SNP may exert a long-range effect on the expression of upstream and downstream genes of the loci. A search of genes that were present within 1 Mb of the SNPs (rs27042, rs469876) revealed some potentially interesting genes (Fig. 3a, b).

Fig. 3
figure 3

Regional plots of the two SARS-CoV-2 infection susceptibility loci

Discussion

Individuals carrying specific variants of genes directly involved in viral infection (e.g., ACE2, TMPRSS2) or exhibiting differential expression of those genes may have inherently different susceptibility to SARS-CoV-2, which may explain the broad spectrum of symptoms and disease severity associated with COVID-19. We have conducted the genetic association study for the susceptibility to SARS-CoV-2 and COVID-19 severity.

Recently, genetic variants in ABO, LZTFL1, ABCD3, C4BPA, TMEM181, BRF2, ERAP2, ALOXE3, and IFNAR2 genes have been reported to be associated with SARS-CoV-2 infection or COVID-19 severity[10, 15,16,17,18,19,20,21,22]. In the genome-wide association analysis, our study showed potential correlation between genetic variability in POLR2A, ANKRD27, MAN1A2, and ERAP1 genes (in Table 2) and the disease susceptibility. We compared our results with previously identified candidate regions. However, these loci except ERAP1 have not been reported in European and other populations. D’Amico et al. [23] reported that the dysfunctional status of ERAP1 and ERAP2 enzymes may exacerbate the effect of SARS-CoV-2 infection. Surprisingly, we noted that rs27042 and rs469876 were closely associated with susceptibility to COVID-19. SNP rs27042 and rs469876 at chromosome5q15 (linked with r2 = 0.83) of the ERAP1 gene were significantly associated with the mRNA levels of ERAP1 in eQTL analysis. ERAP1 is an M1 zinc metalloprotease family member, which contains a transmembrane domain and an active site with GAMEN and Zn-binding HEXXH(X) 18 E motifs. ERAP1 plays a role in peptide trimming in the generation of most HLA (human leukocyte antigen) class I–binding peptides, and is also involved in regulating proinflammatory cytokine signaling through cleavage of cytokine cell surface receptors [24]. Studies have highlighted the role of ERAP1 in innate immune-mediated pathways involved in inflammatory responses, and indicated that SNP in ERAP1 is associated with a number of autoimmune/inflammatory conditions [25,26,27].

We further explored the association between genetic variants and COVID-19 severity. GWAS using symptomatic patients of COVID-19 as the case and asymptomatic patients as the control suggested potential correlation between genetic variability in PCNX2, CD200R1L, ZMAT3, PLCL2, NEIL3, and LINC00700 genes (in Table 3) and COVID-19 severity. None of these genes had been previously reported in the literature. Among the genetic loci associated with severe COVID-19, the 3p21.31 gene cluster has been reported to be robustly associated with COVID-19 severity. CCR9 and CXCR6 have been identified as putative causal genes of the 3p21.31 locus [10]. We found little evidence to suggest that allele frequency differences at this locus could account for the higher rate of severe outcomes from COVID-19. Janie F. Shelton and Anjali J. Shastri et al. also found no correlation between gene cluster 3P21.31 and severe COVID-19 [28]. In fact, the primary risk allele at the chromosome 3p21.31 locus is most common in European populations [29]. It is well known that individuals of diverse racial and ethnic backgrounds harbor different allelic variants.

There are several limitations in our study. Firstly, the sample size is relatively small, and the power analysis indicates that sample size of around 250 is barely sufficient to identify genome-wide significant genetic variants with MAF greater than 0.2 and odds ratio greater than 1.8 given type I error rate 0.05. Secondly, the patients with COVID-19 were only divided into asymptomatic and symptomatic group, which were not subdivided based on their severity. Thirdly, the extent to which individual genetic variant affects the susceptibility to SARS-CoV-2 and COVID-19 severity has not been assessed yet.

Conclusion

This study shows the genetic variability of SARS-CoV-2 genomes in Chinese population using multiple sequence alignment techniques. We have discovered new and highly plausible genetic association with the susceptibility to SARS-CoV-2 infection and COVID-19 severity. Importantly, our results provide preliminary insights that necessitate functional validation in future studies.