Introduction

Systemic lupus erythematosus (SLE, OMIM 152700) is a clinically heterogeneous autoimmune disease with an estimated heritability of 0.66 similar to other autoimmune diseases (Selmi et al. 2012). In the past decade, genome-wide association studies (GWAS) have identified more than 100 risk loci that are robustly associated with SLE (Chen et al. 2017; Langefeld et al. 2017). The risk variants identified by GWAS are rarely located in protein-coding exons, instead most of them are common variants thought to affect regulatory genomic regions such as promoters and enhancers (Hindorff et al. 2009; Farh et al. 2015).

In addition, there exist several monogenic disorders with an SLE-like phenotype that are inherited in a Mendelian fashion and are caused by mutations in one out of 32 so far known genes (Tsokos et al. 2016). These genes have been identified by familial manifestation of SLE that is mainly shared between mother and daughter or between female sibling pairs in a family. In ten of these genes there are mutations that cause classical SLE where a patient fulfills the classification criteria for SLE (Tan et al. 1982). Another set of 12 genes carry mutations that cause dysregulation of genes in the type I interferon (IFN) system, which is a prominent feature shared by the majority of patients with SLE (Hagberg and Ronnblom 2015).

In aggregate, monogenic forms of SLE contribute only to a small fraction of all SLE cases. The most common form of monogenic SLE is caused by mutations in the TREX1 gene that have been identified in 0.5–2% of adult SLE patients (Lee-Kirsch et al. 2007; Namjou et al. 2011). The highest penetrance of an SLE-like disease has been observed for mutations in the complement system, with a particularly high penetrance for complement factor 1 and 4 deficiencies, while a lower penetrance has been observed for the more common complement factor 2 deficiency (Pickering et al. 2000). The variants in complement system genes represent less than 1% of all SLE cases combined. Highly penetrant monogenic diseases manifest when a protein-coding gene is affected by mutations in one or both alleles, depending on if the deleterious allele is recessive or dominant. A recessive disease-causing effect can be the result of a homozygous deleterious genetic variant or by compound heterozygosity in a protein-coding gene where different deleterious variants have been inherited from each parent. However, more subtle effects of heterozygous mutations have been observed for variants connected to Mendelian diseases (Sidransky 2006; Valente and Ferraris 2007) blurring the line between Mendelian and complex disorders.

To increase the power of finding associations for rare mutations in a case–control association setting, there are a number of tests that combine the effect of several variants within a region of interest into one test. Examples of these are burden tests (Morgenthaler and Thilly 2007; Han and Pan 2010) and variance component tests (Wu et al. 2011). An even broader approach is to test for enrichment of variants in selected features in a set of genes (Singh et al. 2017). A completely global approach is to use machine learning on all called variants to be able to separate healthy individuals from patients (Abraham and Inouye 2015). We have previously used this approach in SLE where we trained a random forest model using the variants from 1160 patients and 2711 controls genotyped on the ImmunoChip to obtain a SLE risk score (Almlof et al. 2017).

Using whole-genome sequencing (WGS) of parent-offspring trios, it is possible to find almost all single nucleotide variants (SNVs) and most smaller insertions–deletions (INDELs), while at the same time identifying the parent of origin for many of the variants. Whole exome sequencing (WES) of SLE family trios has identified de novo mutations and potential novel SLE genes (Pullabhatla et al. 2018). WES has also successfully identified rare variants that are likely pathogenic in SLE (Delgado-Vega et al. 2018) and WGS of monozygotic twins discordant for SLE has found CNVs that may be associated with difference in SLE phenotype between twins (Chen et al. 2018).

In this study, we performed whole-genome sequencing (WGS) of samples from 71 Swedish SLE trio families with two healthy parents and one child affected by SLE. We employed the trio study design to investigate rare risk variants for SLE located in functional elements in, and in the vicinity of, genes carrying variants that are known to cause monogenic disorders with an SLE-like phenotype. Using a combination of WGS trio data with the previously trained random forest, it was possible to investigate the parent of origin for called variants and elucidate possible differences in inheritance depending on sex and type of variants.

Results

Risk of SLE from common SNPs is mainly inherited from one parent

In an earlier study (Almlof et al. 2017), we developed a random forest (RF) model to determine a score that indicates the risk to develop SLE based on the genotype data from a Swedish SLE case–control association study using the ImmunoChip with approximately 120 k SNPs across 186 loci known to be associated with immune-mediated diseases (Illumina) (Cortes and Brown 2011). We here used the single nucleotide variant (SNV) calls from WGS of 71 trio families with the offspring affected by SLE that overlap with the SNVs included on the ImmunoChip (97.4% overlap) to determine the RF derived risk scores for SLE for the trio family members. We used the scores to compare the risk of SLE for the parents in the trio families with that of healthy Swedish controls (n = 2711) and to compare the risk scores for the patients with SLE in the trio families with the risk scores for the larger cohort of SLE patients, who were included in the ImmunoChip case–control study (n = 1160). According to the prediction by the RF model, the parents in the trio families had a higher average SLE disease score than the healthy controls (34% vs 27%), but a lower average disease score than the SLE patients in the trio families (34% vs 42%). The average risk of SLE for the parent with the higher risk of SLE in each family was of similar magnitude as that for the patients (42%), while the parent with lower risk of SLE displayed an equally low risk of SLE as the controls (26%). These risk predictions indicate that the complex genetic predisposition for SLE is mainly inherited to the patient from one of the parents in a family.

Support for the one-parent mode of inheritance is provided in (Fig. 1a) where the distribution of the risk scores for SLE between the members of the trio families show similarities between the parents with the higher SLE risk score and the SLE patients, while the distribution for the parents with a lower SLE risk score show similar distribution as the controls. Another way to illustrate this is through correlation of the risk score of the SLE patients and of the parents (Fig. 1b). There is a highly significant correlation coefficient of 0.47 (p value 2.12E − 11) between the risk scores for the parent with the higher risk of SLE in each family and those of the SLE patients in the trios. The correlation coefficient of 0.47 should be compared to that of the parents with a lower risk score of SLE, who had a correlation coefficient of only 0.15 with the SLE risk of the patients in the trio, where the correlation is mainly driven by a few high-risk samples. Notably, there was no difference in average risk scores between the mothers and the fathers.

Fig. 1
figure 1

Risk score for systemic lupus erythematosus (SLE) of parents and patients with SLE in the family trios. a Distribution of predicted random forest risk scores for SLE patients (n = 71), their parents and healthy controls. The two parents in each family are separated into higher and lower risk based on their respective random forest risk score. b Linear correlation between the random forest risk score for SLE of the patients and of the parent with higher SLE risk score in each family trio is shown in blue. The correlation between the SLE risk score for the SLE patient and the parent with lower risk of SLE in each family is shown in orange

Enrichment of ultra-rare missense variants in genes associated with monogenic SLE

Next, we investigated if the variants called in WGS data from our patients with SLE were enriched in promoter and protein-coding regions of SLE genes in comparison to the recently published Swedish genomes reference dataset [SweGen (Ameur et al. 2017)]. For the variants in protein-coding regions, we only considered non-silent variants. The enrichment analysis included variants in 22 genes that are reported to cause monogenic forms of classical SLE or dysregulation of the type I interferon system (Supplemental Table S1).

In the SLE patients from the trio families, we observed an enrichment (OR = 2.07, p value = 0.00182) of ultra-rare missense variants with minor allele frequency (MAF) ≤ 0.1% in protein-coding regions of genes known to cause monogenic forms of SLE (Fig. 2). The majority (20 out of 21) of these ultra-rare sequence variants was observed in the heterozygous form. The 21 ultra-rare sequence variants identified in 18 patients represent an excess of 10.9 variants compared to that expected by chance according to the enrichment analysis. Thus, approximately one-seventh of the SLE patients included in our analysis seem to carry rare risk variants with small to medium effect sizes in one of the genes causing monogenic SLE. For variants with higher MAF and variants in promoters, we did not observe any significant enrichment. Variants close to genes that have previously been associated to SLE in GWAS studies were also investigated in a similar fashion as the genes associated with monogenic SLE but no significant enrichment was found.

Fig. 2
figure 2

Enrichment analysis of missense and promoter variants. p values for enrichment of missense and promoter variants in genes causing monogenic forms of SLE are shown on the vertical axis at different minor allele frequencies as indicated on the horizontal axis. The red line shows the 0.05 significance threshold after multiple testing correction. Values below zero on the horizontal axis indicate depletion

Functional annotation of rare variants in genes causing monogenic SLE

The potential functional impact in SLE of each of the 21 rare SLE risk variants was assessed based on their functional annotations, effects or locations in the encoded proteins, DANN score, and predicted effect on the protein function by the SIFT or PolyPhen2 programs. In one of the patients, we found a previously reported homozygous nonsense mutation in the C1QC gene (Arg69*) (Schejbel et al. 2011). A non-functional C1q protein leads to lupus-like symptoms with 85% penetrance and to SLE that fulfills the American College of Rheumatology (ACR) criteria for classification of SLE (Tan et al. 1982) with 50% penetrance (van Schaarenburg et al. 2016). The patient with the homozygous nonsense mutation in the C1QC gene suffers from immunodeficiency and a severe SLE phenotype (Bolin K, Eloranta M-L, Kozyrev SV, Dahlqvist J, Nilsson B, Knight A, Rönnblom L, manuscript in preparation). In addition, we detected seven heterozygous missense or truncating mutations in seven patients located in five genes (C1S, DNASE1L3, DNASE1, IFIH1, and RNASEH2A) with high potential to contribute to SLE. The identified variants are described in detail in Table 1 and calling quality measures for the variants are listed in Supplemental Table S2, showing the high reliability of the variant calling. Two of the genes (DNASE1 and IFIH1) contain two unique mutations. Five of the variants are reported in dbSNP, all with low MAF in Europeans and at most 0.05% MAF in the SweGen reference dataset (Ameur et al. 2017). However, two of the variants found in DNASE1 have a markedly higher MAF in African populations. The last two variants are not found at all in the Swedish reference population or in dbSNP. Each of the variants was only found in one patient.

Table 1 Summary of missense and nonsense variants that are predicted to affect function identified in 71 trio families in genes carrying known variants causing monogenic SLE

Mode of inheritance of rare risk variants

To examine the mode of inheritance of the eight rare risk variants for SLE reported in Table 1, we investigated if there were any patterns that showed from which of the parents the variant was inherited or if it was randomly inherited. The SLE risk scores for the eight patients with the rare risk variant were not significantly different from the other patients in the study. However, the inheritance of the risk score was not randomly distributed. We found that there was a high correlation (R2 = 0.86) between the RF risk score of the parent lacking the SLE risk variant identified in Table 1 and the patient (Fig. 3a). On the other hand, no correlation was observed between the RF risk score of the parent having the SLE risk variant and the RF risk score of the patient (Fig. 3b). Thus, the genetic burden of SLE in the child is mostly inherited from one of the parents with the added burden from the other parent in the form of the rare risk variant identified here.

Fig. 3
figure 3

Linear correlation between the random forest risk score for SLE of the patients and the parents with and without any of the eight reported rare variants. a The orange line shows the high correlation between the RF risk score for the parent lacking the rare variant and the patient. b The blue line shows the absence of correlation between the parent carrying the rare variant and the SLE patient

Clinical characteristics of patients with heterozygous rare risk variants

By comparing the frequencies of SLE sub-phenotypes, as described by the 11 ACR criteria, between the seven patients with heterozygous rare risk variants with all patients in this study, we were able to distinguish if this sub-group presented a unique disease manifestation. Strikingly, none of the patients with heterozygous rare risk variants had nephritis compared to 38% in the entire cohort. However, the difference are only nominally significant (p = 0.022) before multiple testing correction for the 11 ACR criteria tested. None of the other ACR criteria show any trends between the patient groups.

Discussion

Rare genetic variants that have remained undetected due to limitations in statistical power are believed to be one of the causes of the “missing heritability” observed despite many large GWAS of complex diseases. Burden or aggregate association tests, in which all rare variants affecting the same gene are combined into one test, are used to increase the statistical power for rare variant association. Some recent studies have succeeded in identifying genes with rare variants with statistical significance, exemplified by RNASEH2 in SLE (Gunther et al. 2015), whilst rare variants in other genes have failed to be replicated, like SIAE in RA (Surolia et al. 2010; Hunt et al. 2011). Here, to further increase the statistical power, we simultaneously analyzed rare variants in multiple genes that have been shown to cause Mendelian forms of SLE. Using this approach, it is not possible to observe association between individual genes and SLE, instead we obtain a measure of the enrichment of disease-contributing rare variants in all tested genes. However, we are limited in power by the low number of samples studied. We will therefore only pick up the strongest signals and might miss weaker signals present in for example promoters, enhancers, or variants at different minor allele frequencies. In addition, reproducibility of the exact reported variants is problematic due to the rarity of the variants. On the other hand, the enrichment of rare variants in genes associated to monogenic SLE should be easier to confirm.

The enrichment of SNVs in the genes causing monogenic SLE was calculated by comparison with the reference genomes of a thousand healthy individuals that constitute the SweGen dataset (Ameur et al. 2017). The variant calling procedure differs between our study and the SweGen dataset as we utilize the trio information to improve the variant calling accuracy. This will have the greatest impact on private variants as they will gain support from at least one parent in our study. To minimize this effect, we normalized the enrichment based on the difference in the total number of variant calls between the two datasets in the relevant minor allele frequency range and annotated functional elements.

As shown in Fig. 1a, the distribution of the risk scores generated using a random forest model for SLE patients is bimodal. This could partly be a consequence of the low sample size. However, a less pronounced bimodal distribution of the risk scores remains when including all the 1160 genotyped patients to construct the random forest predictor, which suggests that a bimodal distribution is an accurate representation of the data, and that two distinct groups of patients with differing genetic risk for SLE exist within the SLE patient population studied here. However, there is no significant association between risk scores and any of the ACR criteria, sex, or age of onset. The apparent difference in SLE risk is instead probably mainly due to the fact that the ImmunoChip does not cover all the variations found in SLE. The ImmunoChip targets only approximately 120 k SNPs across 186 loci known to be associated with immune-mediated diseases and thus most of the rare variations will remain undetected.

In our study, we found that the RF risk scores of the parents without the rare risk variant had a high correlation with the RF risk score of the patient (Fig. 3a). The parents with the rare risk variant on the other hand showed no such correlation (Fig. 3b). This observation suggests that the risk variants with higher minor allele frequencies are inherited from one parent and that the additional genetic burden needed to trigger SLE in the child is inherited from the other parent in the form of a very rare risk variant affecting a gene known to cause monogenic SLE. To draw a parallel to cancer, it would constitute the second hit needed to develop the disease. These patients could also be viewed as a new subgroup of SLE patients with an intermediate genetic risk compared to the patients with monogenic SLE and those with high-frequency risk variants found by GWAS.

Most of the ultra-rare candidate risk variants for SLE identified in our study encode amino acids located close to functionally critical amino acid residues, but they may not be critical alone. For example, variants in C1S and DNASE1 are located close to active sites of these enzymes, variants in DNASE1 and DNASE1L3 affect the Ca2+ binding loop in the corresponding proteins, but are not involved in the actual binding, variants in IFIH1 and RNASEH2A are spatially close to known SLE-like disease-causing variants in the proteins. Such variants could affect the protein function, but it seems unlikely that they could cause complete inactivation of the protein, instead they might contribute to increased risk for SLE in a similar fashion as common risk variants identified by GWAS. Two of the genes (DNASE1 and IFIH1) carry two unique mutations providing extra functional support for these. In addition, the two rare variants in DNASE1 have markedly higher minor allele frequencies in African populations than in Europeans, which could possibly explain part of the 3–4 times higher prevalence of SLE in African populations (McCarty et al. 1995).

The random forest model calculates a SLE risk score which when compared with the risk of healthy individuals can be used to the probability to develop SLE. However, as the disease is rare, even a greatly increased risk would still equal a quite low probability to develop SLE in a single individual, implying that the random forest model in its present form would not be useful in a clinical setting.

Materials and methods

DNA samples

DNA was extracted from peripheral whole blood of 71 SLE patients and their biological parents attending the rheumatology clinics of the university hospitals in Uppsala, Stockholm (Karolinska University Hospital), Lund, and Linköping (Supplemental Table S3). All patients were examined by a rheumatologist and the medical records were reviewed. SLE patients and their parents provided informed consent to participate in the study, and the study was approved by the regional ethics committees. Of the patients 85% were female and averaged 24 years old at SLE onset. The patients fulfilled at least four American College of Rheumatology (ACR) 1982 criteria for SLE (Tan et al. 1982), with the exception of five patients who displayed three ACR criteria together with a clinical diagnosis of SLE, see further Supplemental Table S4. None of the parents had SLE at the time of sample collection and the average age of the parents was over 50 years of age.

Whole-genome sequencing and sequence alignment

Sequencing libraries were prepared from 1 µg of DNA using reagents from the TruSeq PCR-free DNA sample preparation kit (Illumina Inc.) targeting an insert size of 350 bp. 150 bp paired-end whole-genome sequencing was performed on an Illumina HiSeqX sequencer using v2.5 sequencing chemistry (Illumina Inc.). Whole-genome sequencing (WGS) was performed by the SNP&SEQ Technology Platform at Uppsala University, Sweden (http://www.sequencing.se). The sequences were aligned with BWA (Li and Durbin 2009) version 0.7.12 using default parameters and the b37 human reference from the GATK file bundle version 2.8. The reads in the raw alignments were then flagged for duplication and recalibrated using GATK version 3.3.0 (McKenna et al. 2010). The number of average aligned reads was 920 million per sample, which corresponds to an average genomic coverage of 40X. Statistics of the WGS after mapping and variant calling are shown in Supplemental Table S5.

Calling single nucleotide variants (SNVs)

Variants in the WGS data were called jointly in all samples using GATK version 3.5.0 following the GATK best practice protocol (Van der Auwera et al. 2013). In the variant recalibration step, we used positive training data from Hapmap (phred quality score prior likelihood of Q15 which is equal to 97% likelihood that the genotype is correct) and 1000 Genomes Omni 2.5M chip (prior Q12, 94% likelihood) as well as in-house genotype data from the same samples from the Infinium OmniExpressExome-8 v1.3 SNP chip (Illumina) with 958497 SNP markers (prior Q20, 99% likelihood). As additional training data, we used the 1000 Genomes high confidence calls (prior Q10, 90% likelihood) and for annotation and statistics the dbSNP version 138 (prior Q2, 37% likelihood). All data files except the in-house SNP genotype data were obtained from the GATK file bundle version 2.8. Variants were marked as PASS if the variant quality score log-odds (VQSLOD) were higher than the 99th percentile in the training data for SNVs. The variants were then further refined by calculating genotype posterior using the data from parent-offspring trios in GATK. Low quality variants were flagged if the genotype posterior had a score < Q20.

Gene enrichment analysis

Enrichment analysis was performed for 22 genes (supplemental Table S1) known to be involved in monogenic forms of SLE (Tsokos et al. 2016). The analyzed genes cause either monogenic SLE fulfilling four ACR criteria (10 genes) or a SLE-like disease by affecting the type I interferon pathway (12 genes). The odds ratios and enrichment were calculated in relation to the background frequencies in the SweGen reference dataset containing 1000 whole-genome sequenced Swedish individuals sequenced to similar depth and at the same sequencing facility as our data (Ameur et al. 2017). The enrichment analysis was performed for variants affecting the coding sequence and for variants in promoter. The data were then normalized based on the ratio for all variants in the relevant annotations and allele frequencies between the two studies.

Annotation of SNVs

The variants from all datasets were annotated using Annovar version 2016.05.11 (Wang et al. 2010). Chromatin state annotations of promoters were obtained from the ChromHMM (Ernst and Kellis 2012) predictions for the B-lymphocyte cell line GM12878. Relative gene positions were obtained from the RefSeq database (Pruitt et al. 2007). Minor allele frequencies in the Swedish population were retrieved from the SweGen database (Ameur et al. 2017) and from the European samples in the 1000 Genomes project (Genomes Project et al. 2015). Known SNVs were annotated using dbSNP release 138 (Sherry et al. 2001). The effect of nsSNVs on the encoded proteins was according to the predictions by SIFT (Kumar et al. 2009) and PolyPhen2 (Adzhubei et al. 2010). For identifying potential pathogenic variants, the DANN score (Quang et al. 2015) was used, where 1.0 is maximal pathogenic potential and 0.0 is minimal potential. The DANN score together with the Combined Annotation-Dependent Depletion (CADD) score have the best performance to discriminate germline pathogenic mutations according to recent benchmarks (Drubay et al. 2018).

Conclusion

We found that the higher minor allele frequency risk variants for SLE are mainly inherited to the patient from one of the parents in a trio family, while in some cases the second parent contributes with rare risk variants in genes causing monogenic forms of SLE. Based on enrichment analysis in functional elements, 11 of the 21 risk variants identified in our study should contribute to SLE, while we found evidence for eight of the identified variants to have an effect on the function of the encoded protein. Thus, rare variants in genes known to cause monogenic SLE could contribute to the risk of SLE in one out of nine patients which suggests a larger impact of rare variants in SLE than hitherto reported. In the absence of a replication cohort and functional validation of the rare variants reported here, future studies are needed to confirm these findings.