Background

Chronic obstructive pulmonary disease (COPD) is characterized by irreversible airflow obstruction and is strongly influenced by genetic factors [1, 2]. Genome-wide association studies of COPD and related traits (e.g., emphysema) have revealed multiple genetic loci associated with disease risk [3,4,5]. Most loci identified by genome-wide association studies (GWAS) are regulatory, and do not directly alter the amino acid sequence.

Gene expression is arguably the most impactful and well-studied effect of regulatory genetic variation. GWAS loci are enriched for expression quantitative trait loci (eQTL), rendering it a potential link between genetic variant and biology of disease [6, 7]. The efforts of large cohort studies and consortia such as the Genotype-Tissue Expression Project have discovered thousands of genetic variants associated with gene expression in multiple tissues. While most GWAS studies do not concomitantly measure gene expression, the strong relationship of genetic variation to gene expression allows one to use gene expression reference datasets to predict gene expression given a set of genotypes, and subsequently identify gene expression differences for a given phenotype. This approach has been implemented in software called S-PrediXcan and TWAS [8,9,10]. Aggregating information from variant level to infer gene-level associations increases the power to discover more genes at loci not previously implicated by GWAS and gives mechanistic insight regarding genes being regulated via disease-associated genetic variants [7, 11].

Despite the convention of naming a discovered locus for the nearest gene (e.g., HHIP), further study is needed to identify the specific gene(s) and variant(s) responsible for disease risk [9, 11, 12]. In identified COPD susceptibility loci, most loci contain multiple genes, and variants in these genes are correlated (in linkage disequilibrium). More than one gene in a locus may also play a role in disease pathogenesis, as seen in other complex diseases [13, 14]. With recently developed methods and a growing amount of gene expression data made publicly available, integrating GWAS with known functional annotations of each variant (e.g., associated with gene expression) could highlight novel and biologically relevant genes for further evaluation.

We hypothesized that application of these integrative methods to specific phenotypes of COPD (severe disease and quantitative emphysema) would facilitate discovery of new gene-disease associations and elucidate the mechanism of gene in existing susceptibility loci. Specifically, we sought to identify genes and pathways genetically up- or down-regulated by phenotype-associated variants in tissue-specific reference datasets using S-PrediXcan and TWAS [3, 5], and to assess the potential causality of individual genes using multi-variant colocalization.Footnote 1

Methods

Genome-wide association studies and meta-analysis

We used genome-wide association summary statistics for two phenotypes based on the same four cohorts. Demographic characteristics of individuals included in analyses of these two phenotypes are summarized in Tables 1 and 2. The four cohorts included individuals enrolled in Genetic Epidemiology of COPD (COPDGene, NCT00608764), Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE, SCO104960, NCT00292552), National Emphysema Treatment Trial (NETT) and Normative Aging Study (NAS), and GenKOLS (Genetics of COPD, Norway). Meta-analyses of these two phenotypes were published previously [3, 5]. Severe COPD was defined by post-spirometric measures of forced expiratory volume in 1st second (FEV1) lower than 50% of predicted value and the ratio of FEV1 to forced vital capacity (FEV1/FVC) less than 0.7, excluding individuals with known severe alpha-1 antitrypsin deficiency. For quantitative emphysema, we produced the histogram of segmented CT chest images and used the percentage low attenuation area at − 950 Hounsfield units (HU) threshold (%LAA-950), and the HU at the 15th percentile of the density histogram (Perc15) for the quantification of emphysema. A summary of our approach is shown in Fig. 1.

Table 1 Demographic characteristics of individuals in the analysis of severe COPD
Table 2 Demographic characteristics of individuals in the analysis of quantitative emphysema
Fig. 1
figure 1

Summary of analyses. First, we discovered transcriptome-disease associations (predicted gene expression-disease) using reference data from DGN-Blood and GTEx-Lung. Then, we validated these associations using another set of reference data (GTEx-Blood and Lung-eQTL Consortium). Finally, we confirmed the transcriptome-disease associations using colocalization analysis. COPD = chronic obstructive pulmonary disease; DGN = Depression Gene Network; GTEx = Genotype-Tissue Expression project; Perc15 HU at 15th percentile of the density histogram; severe COPD is defined as FEV1 < 50% predicted and FEV1/FVC < 0.7

Integration of GWAS and gene expression

To integrate our GWAS and gene expression results, we used S-PrediXcan [10]. We included two relevant reference transcriptome databases in our analysis, whole blood from Depression Genes and Networks (DGN-Blood) and lung tissue from Genotype-Tissue Expression consortium (GTEx-Lung). Details on prediction models and datasets used were provided in Additional file 1: Supplementary Methods. The ability of genetic variants to predict the expression of individual genes varies; only genes with significant prediction models were included in the analysis (11,529 genes for DGN-Blood and 6425 genes for GTEx-Lung). We accounted for multiple hypothesis testing using Bonferroni correction to determine statistical significance of gene-disease associations, resulting in p-value of 4.34 × 10− 6 and 7.78 × 10− 6 for DGN-Blood and GTEx-Lung, respectively.

Validation in other reference transcriptome databases

To determine whether our imputed gene expression was consistent in other datasets, we tested significant genes from DGN-Blood and GTEx-Lung in two independent reference transcriptome databases, GTEx for whole blood (GTEx-Blood) and the Lung-eQTL Consortium for lung tissue using S-PrediXcan and TWAS/FUSION (Additional file 1: Supplementary Methods). We considered an expression result to be validated if the direction of effect was consistent and the Bonferroni-corrected P-value < 0.05.

Colocalization analysis using eCAVIAR

Colocalization analysis estimates a posterior probability that a given variant or set of variants is causal for both the phenotype of interest (e.g., COPD) and expression level of a given gene. We used eCAVIAR (eQTL and GWAS Causal Variant Identification in Associated Regions), as it allows for multiple causal variants [15]. Details on parameters and procedures used in the analysis were present in Additional file 1: Supplementary Methods. Genes identified in whole blood were tested for colocalization using eQTL from GTEx-Blood while using GTEx-Lung and Lung-eQTL Consortium for lung tissue. The probability of a variant to be causal for a given gene in both datasets was determined by the colocalization posterior probability (CLPP) that approximates the posterior probability of a variant to be causal in GWAS and posterior probability of a variant to be causal in eQTL [15]. We also obtained functional annotations of colocalized variants in lung relevant cell types (Additional file 1: Supplementary Methods).

Results

Severe COPD

We first examined the association between severe COPD and imputed gene expression. Significant associations based on gene-based Bonferroni corrections for DGN-Blood and GTEx-Lung are shown in Table 3 and Fig. 2.

Table 3 Result of association analysis between imputed gene expression and severe COPD and emphysema (%LAA-950 and Perc15) with validation
Fig. 2
figure 2

Manhattan plots of associations of imputed gene expression and phenotypes (severe COPD in the upper panel; %LAA-950 and Perc15 in the lower panel). Color indicates phenotypes and shape indicates tissue (see figure legend)

In the whole blood reference dataset from DGN, we identified five significant genes: FAM13A in 4q22 (P = 4.81 × 10− 8), HYKK and PSMA4 in 15q25 (P = 8.16 × 10− 17 and 2.47 × 10− 14, respectively), and EGLN2 and RAB4B in the 19q13 locus (P = 1.03 × 10− 6 and 1.72 × 10− 6, respectively). All of these genes are located in COPD susceptibility loci previously reported in the literature [4, 16]. In lung tissue, we identified two genome-wide significant genes, GPRIN3 in the 4q22 locus (P = 7.43 × 10− 7) and WNT3 in the 17q21 locus (P = 1.24 × 10− 6); the latter locus was not identified in the single variant GWAS of severe COPD (Fig. 3).

Fig. 3
figure 3

Regional association plots within 50 kb of WNT3. GWAS of severe COPD and lung eQTL are shown in the upper panel. Chromatin states and epigenomic marks of normal human lung fibroblasts are shown in the lower panel (see Additional file 1: Supplementary Methods)

Emphysema

In whole blood and lung tissue, we identified five genes significantly associated with %LAA-950 and one gene with Perc15 (Table 3; Fig. 2). We found two significant associations of genes at loci previously associated with %LAA-950, PSMA4 in 15q25 and ATF6B in 6p21, the latter which is located near AGER. The top genome-wide significant variant at this latter locus – which lies within the HLA (Human Leukocyte Antigen) region – is a nonsynonymous variant in AGER; however, AGER was not significant in either blood or lung (P = 0.81 and 0.18, respectively). LILRA3, DCBLD1, and ITGA1 are at loci not previously associated with COPD or emphysema.

Validation in other reference transcriptome databases

To provide further evidence for differentially expressed genes associated with severe COPD and emphysema, we repeated our analysis using additional reference transcriptome databases with the same GWAS data. In blood, we validated PSMA4, EGLN2, and RAB4B for severe COPD (P = 3.79 × 10− 14, 1.34 × 10− 5, and 1.33 × 10− 4, respectively), and PSMA4 and LILRA3 for %LAA-950 (P = 3.37 × 10− 7 and 3.62 × 10− 5, respectively) by using GTEx-Blood as a validation for genes identified through whole blood transcriptome analysis (Table 3). We also validated WNT3 for severe COPD (P = 4.27 × 10− 6) and DCBLD1 for %LAA-950 (P = 1.41 × 10− 4) for genes identified from GTEx-Lung using a lung transcriptome database from Lung-eQTL Consortium (Table 3). We also noted that for several genes, a prediction model was not available, likely due to lower power and sample size in the validation dataset for whole blood [9]. Although the association of FAM13A was initially identified using blood dataset, its association was significant using Lung-eQTL Consoritium (Z score = 4.52, P = 6.3 × 10− 6).

Colocalization analysis of validated genes

Gene expression differences identified using S-PrediXcan may be causally associated with the phenotype of interest, but also can be due to linkage disequilibrium (LD) [15]. To determine whether there was evidence of shared causality, we performed colocalization analysis, using a method that allows for multiple causal variants. Of the seven associations, six occupied at least one shared variant (Table 4): PSMA4, EGLN2, and WNT3 (Fig. 3) for severe COPD; PSMA4, LILRA3 (Additional file 1: Figure S1), and DCBLD1 (Additional file 1: Figure S2) for %LAA-950. For associations identified in lung, we additionally confirmed the colocalization signals using the Lung-eQTL consortium dataset (Additional file 1: Table S1). We then sought to leverage functional annotation of shared variants especially for those with high colocalization probability. Some colocalized variants associated with PSMA4, LILRA3, DCBLD1, and WNT3 located in annotated regulatory regions (e.g., rs35061187 is in active transcription start site (TSS) in lung fibroblasts) or predicted to affect transcription factor binding (Additional file 1: Table S1 and S2).

Table 4 Colocalized variants in validated genes and association statistics in corresponding GWAS and eQTL datasets

Genetically regulated differential expression of genes in known susceptibility loci

Of the above significantly differentially regulated genes, four are in known susceptibility loci (4q22 and 15q25 with severe COPD, and 6p21 and 15q25 with %LAA-950). We also sought to investigate whether additional known susceptibility loci for severe COPD and quantitative emphysema affect the genetically regulated expression of nearby genes. We investigated nominal association results (P < 0.05) in other nine susceptibility loci in either discovery or validation datasets. Using this criterion, we found 5 additional suggestive associations, namely TGFB2 (1q41), HHIP (4q31), and RIN3 (14q32.12) with severe COPD, and HHIP (4q31) with %LAA-950 and Perc15 (Additional file 1: Table S3). However, we did not find any suggestive signals in 11q22 (MMP12) with severe COPD, 14q32.13 (SERPINA10) with %LAA-950, and 8p22 (DLC1) with %LAA-950 and Perc15.

Pathway enrichment analysis

In contrast to genetic gene set enrichment methods that rely only on the location of the SNP to infer affected genes [17], we used the results of our predicted gene expression to identify pathways by using the top 1% of differentially expressed genes (Table 5, Additional file 1: Supplementary Methods). We identified enrichment of the T cell receptor signaling pathway (corrected P = 6.6 × 10− 3); this pathway included PSMA4 along with genes in the HLA complex. We also found significant enrichment for proteasome core complex genes (corrected P = 2.82 × 10− 2) which included PSMF1, PSMB4, and PSMB9. An additional pathway of interest was cell-matrix adhesion of collagen binding (corrected P = 2.74 × 10− 3) (Table 5). We also found enrichment of the asthma pathway using the KEGG database (corrected P = 4.80 × 10− 3), containing MS4A2 and genes in HLA.

Table 5 Selected results of pathway enrichment analysis based on predicted differential gene expression

Discussion

Genome-wide association studies have arguably become the mainstay of identifying genetic risk factors for complex disease. However, these studies cannot identify which gene(s) in the region is responsible for the association, and testing all variants individually and independently is likely suboptimal. Here, we used an integrative method that combines the genetic component of gene expression with genetic association analysis in severe COPD and quantitative emphysema to predict differentially expressed genes. Importantly, this method focuses on the association of genetic component of gene expression, not gene expression as a whole, as is typical in most gene expression studies. We also provided additional support of our results by examining results in a second gene expression dataset, and performing colocalization analysis that attempts to identify whether association signals for gene expression and a phenotype of interest appear to be driven by the same causal variant(s). We implicated genes that are genetically regulated in known COPD-susceptibility loci, such as FAM13A, and also found genes in regions that were not previously reported: WNT3 for severe COPD, and DCBLD1 and LILRA3 for quantitative emphysema.

We found a novel association of WNT3 in lung tissue with severe COPD in two gene expression datasets. Although variants surrounding this gene in the 17q21 locus were not genome-wide significant in our COPD analysis GWAS (Fig. 3), the top signal (rs9912530) is in strong LD with variants previously reported in GWAS of FEV1 [18, 19], interstitial lung disease [20], and idiopathic pulmonary fibrosis [21] (r2 with these previously described variants, 0.55–0.72). WNT3 (Wnt family member 3) encodes Wnt3, a critical component of the Wnt-beta-catenin-TCF signaling pathway [22] and a required signal for the apical ectodermal ridge in limb patterning [23]. Deficient WNT3 is associated with tetra-amelia syndrome, a Mendelian disease characterized by an absence of all limbs. The top signal is also in strong LD with variants associated with various complex diseases such as Parkinson’s disease and celiac disease (r2 0.72–0.79) [24, 25]. Previous expression studies of small airway epithelium found that this gene, along with its Wnt signaling companions, was down-regulated in smokers compared with nonsmokers [26]. Of interest, FAM13A, a well-supported COPD susceptibility gene, has been involved in the Beta-catenin/Wnt signaling pathway by protein degradation [27]. While there is substantial interest in Wnt signaling in lung disease [28], the contribution of WNT3 to the pathogenesis of COPD requires further investigation. To address whether these findings were specific for severe COPD, we repeated the analysis including moderate disease (GOLD 2). All of our genes were at least nominally significant, though overall the significance of our findings was attenuated (Additional file 1: Table S4).

For emphysema, we identified novel associations of LILRA3 and DCBLD1 using whole blood and lung tissue, respectively, and validated these findings in additional gene expression datasets. LILRA3 (leukocyte immunoglobulin like receptor A3) is a gene encoding a soluble receptor for class I major histocompatibility complex (MHC) antigens expressed in monocytes and B cells, which is located in the 19q13 locus. Our top hit from GWAS in this locus, was not genome-wide significant (rs384116 with P = 1.88 × 10− 5; Additional file 1: Figure S1), and 13-Mb away from the previously reported locus [16] that contains EGLN2 and RAB4B (rs7937; r2 0.002). It is in modest LD with variants suggestively associated with FEV1/FVC [18] (r2 0.44), in strong LD with variants genome-wide significantly associated with HDL-C level [29] and prostate cancer [30] (r2 0.92–0.99). Blood may be the most relevant tissue for this gene, as it is preferentially expressed [31] with a high estimate of heritability of gene expression in whole blood [32]. However, it may also have an effect in other tissues, given its broad eQTL effects identified by multi-tissue eQTL analysis [33]. This was supported by the suggestive signals of this gene using lung tissue in S-PrediXcan analysis (P = 7.71 × 10− 5 in GTEx-Lung and 1.38 × 10− 4 in the Lung-eQTL Consortium with the same direction of effect). Nonetheless, its functional role in COPD has not been described previously. Our other novel association identified in lung tissue, DCBLD1 (discoidin, CUB and LCCL domain containing 1), located in the 6q22 locus, is an integral component of cell membranes and binds to oligosaccharides [34]. GWAS signals in this locus are also sub-genome wide significant (Additional file 1: Figure S2). Our top GWAS variant at this locus was in LD with variants associated with lung cancer [35] (r2 0.54).

In addition to novel associations, our study also provides insight into disease-associated genes in known COPD susceptibility loci. We identified six genes (FAM13A, GPRIN3, HYKK, PSMA4, EGLN2, and RAB4B) in three known COPD-susceptibility loci for which their genetic component of gene expression in blood or in lung tissue is associated with severe COPD. Five of these six genes are not the most proximal to the top associated SNP, a phenomenon previously observed in other genetic association studies [36, 37]. These findings underscore the complexity of genetic regulation in tissues and also identify multiple potential effector genes in the same locus. For example, in 15q25, PSMA4, and not CHRNA3 (the nearest gene to the top GWAS hit) was highlighted in S-PrediXcan and colocalization analysis. Although a role for IREB2 has been clearly demonstrated [38], our study suggested that other genes in the locus, particularly PSMA4 – a gene encoded for subunit of proteasome complex that acts in the proteolytic pathway [39], may also be of biologic importance.

At the 4q22 locus, an association for FAM13A identified using DGN-Blood was not validated in the GTEx-blood dataset. However, a significant but directionally opposite association was identified in the Lung-eQTL consortium dataset. To further explore this phenomenon, we examined individual SNP eQTL data from the Framingham Heart Study (FHS) blood, and the lung tissue from the Lung eQTL consortium (Additional file 1: Supplementary Methods). We confirmed that SNPs have opposite directions of effect in lung and blood (Additional file 1: Figure S3 and S4). This finding is consistent with prior reports describing significant and opposite tissue specific effects of eQTLs [33, 40, 41]. The interpretation of this phenomenon is not clear, but may be a result of pleiotropic effects of FAM13A [42, 43]. Of note, a recent analysis of emphysema-related gene expression in blood and lung tissue [44] found that the expression of genes in two tissues are often opposite; together, our findings highlight the tissue-specific genetic regulation of genes in COPD susceptibility loci. At the 19q13 locus, while both EGLN2 and RAB4B were successfully validated, only GWAS and eQTL signals for EGLN2 colocalized. This genetic locus was associated with COPD [16] and smoking behavior [45]. Although the causal gene(s) in this region is unclear, methylation and expression studies support the role of EGLN2 in this region [46]. EGLN2 (egl-9 family hypoxia inducible factor 2) encodes an enzyme that regulate the degradation of alpha subunit of hypoxia inducible factor (HIF) [47]. Gene and protein expression of HIF-1α is reduced in lung tissue samples from COPD patients [48].

Although ATF6B (activating transcription factor 6 beta) and ITGA1 (integrin subunit alpha 1) were not successfully validated, we cannot rule out the possibility of false negatives due to differences between the transcriptome datasets used for validation, and they are potentially interesting candidates for COPD. ATF6B was implicated in the unfolded protein response (UPR) pathway during endoplasmic reticulum (ER) stress following cigarette smoke, and may contribute to lung inflammation in patients with COPD [49], while integrins were found to be involved in COPD through the mitogen-activated protein kinase (MAPK) pathway [50, 51]. This region also harbors variants associated with FEV1/FVC [52]. Decreased expression of ITGA1 was observed in the small airways of patients with low FEV1 [53].

Our analysis assesses only the genetic component of gene expression. We also investigated whether these genes were differentially expressed in COPD patients, in 464 blood samples from the COPDGene study [54], and 151 lung tissue samples [55] (Additional file 1: Supplementary Methods and Table S5-S8). These genes were not differentially expressed, with the exception of LILRA3, which was nominally significant with %LAA-950 (P = 0.03). Given that the genetic component of gene expression was replicated, we believe that the genetic findings are robust, and speculate that these null findings could be due to non-genetic (i.e. environmental) perturbations that may occur downstream, or as a result of the genetic effects. In fact, in several cases measurements of mRNA or protein are actually opposite those predicted by genetic risk. For example, SERPINA1 risk alleles result in decreased levels and increased risk for COPD, yet average, alpha-1 levels in patients with COPD are actually elevated. Similarly, genetic variants in AGER and DSP affect transcript or protein levels opposite than what is measured in disease [4, 56, 57]. The mechanisms underlying our genetic findings, as well as AGER and DSP, that result in null or opposite direction effects requires further experimental investigation.

In addition to examination of individual loci, we applied pathway enrichment analysis to nominally significant differentially expressed genes in severe COPD and quantitative emphysema both in whole blood and lung tissue. This analysis identified enrichment of the T cell receptor signaling pathway in emphysema. This finding is consistent with reports that found antigen-specific T cell differentiation in lungs of patients with severe emphysema [58]. Our analysis using gProfileR does not assess of direction of effect, and the relative up- or down-regulation of specific genes in this pathway makes determination of direction difficult. To attempt to infer direction, we used Gene Set Enrichment Analysis (GSEA; [59]). In these results, the TCR signaling pathway and downstream TCR response were up-regulated, though these results were not statistically significant (Additional file 1: Table S9). Further study will be needed to determine the combined effects of COPD genetic susceptibility variants on T cell function and whether these explain some immune dysfunction seen in COPD [60, 61]. The finding of the enrichment of genes in the proteasome core complex further suggested a role of proteasome in COPD as described previously. Somewhat surprisingly, we observed enrichment of the asthma pathway in KEGG using genes identified in quantitative emphysema. This finding complements the description of substantial genetic correlation of COPD and asthma [4], and the presence of quantitative emphysema (or lung hyperinflation) in asthmatic patients [62].

Our study did not identify associations of genetically regulated differential expression of genes at some previously reported GWAS loci. Moreover, some of our identified associations in our discovery dataset were not successfully validated in a second transcriptome dataset. These findings indicate some of the limitations of our approach. First, as S-PrediXcan uses cis genetic variants as predictors for gene expression, variants that have lesser or no effect on transcript abundance or act in trans would not be detected by this approach [63]. Second, although most genetic variants implicated by GWAS are likely regulatory, only a minority of genetic loci are explained by existing eQTLs [64]. This may be due to lack of data in the appropriate tissue, cell type, or biologic conditions; or the heterogeneity of gene expression studies of bulk tissue. We may overcome these issues as more gene expression datasets and newer techniques such as single-cell gene expression profiling [65] become widely available. Moreover, issues such as cell type composition, sample collection methods, disease status, and differences in analytic methods also made the overlapping analysis challenging. Third, the number of genes available for an analysis depends on the power and sample size of the expression data used in constructing a gene expression prediction model [8, 9]. Given the noisy and condition-specific nature of gene expression datasets, variants with small effects on gene expression may be undetectable at the sample sizes available. Additionally, the difference in sample size among transcriptome databases decreases our power to validate or discover more genes.

However, despite technical and population differences, most cis-eQTLs appear to be consistent between studies [66]. Therefore, despite in some cases a modest value of overall coefficient of correlation between predicted and measured gene expression, associations of the genetic component of gene expression as inferred by imputed gene expression have been successfully in identifying disease-associated genes that complement existing methods.

Conclusions

In conclusion, we found that genetic determinants of gene expression were associated with severe COPD and quantitative emphysema phenotypes, identifying genes at known loci, and identifying novel COPD-associated genes. These findings were obtained by integrating GWAS results with gene expression data, performing colocalization analysis, and validating key results in independent gene expression datasets. These findings may provide mechanistic insights into the genetics of COPD.