Introduction

Osteoporosis is a highly polygenetic disease characterized by low bone mass and deterioration in bone microarchitecture, leading to increased skeletal fragility and fracture risk [1,2,3]. A low bone mineral density (BMD), independently of bone quality or structure, measures the mineral component of bone and is a strong clinically relevant risk factor for osteoporosis and a key indicator of its diagnosis and treatment [4, 5]. Although BMD is most often measured by dual-energy X-ray absorptiometry (DXA) scanning in clinical settings, an alternative method of estimating the BMD is derived from ultrasound, typically at the heel (referred to here as estimated BMD (eBMD)). A previous genome-wide association study (GWAS) of eBMD that used heel ultrasound parameters identified 84% of all currently known genome-wide significant loci for DXA-derived BMD [6, 7], and the effect sizes were concordant between the two traits (Pearson’s r = 0.69 for the lumbar spine and 0.64 for the femoral neck) [7]. The ultrasound-derived eBMD values are highly heritable (on the order of 50% to 80%) [8], and a recent GWAS of eBMD with 426,824 individuals identified 1103 independent genome-wide significant associations at 518 loci, which is approximately 6.4-fold greater than the number of discoveries from DXA GWASs [9, 10]. However, the majority of GWAS hits are in noncoding regions, and their biological mechanisms are difficult to interpret [11, 12].

The effect of genetic variation on phenotype is complex and might involve altering the abundance of one or more proteins by regulating gene expression and then affecting the trait (SNP-Expression-Phenotype) [13, 14]. Gene expression is arguably the most impactful and well-studied effect of regulatory genetic variation. GWAS-derived loci are enriched for expression quantitative trait loci (eQTLs), which render these a potential link between the genetic variant and the biology of the disease [15,16,17]. Although most GWAS do not concomitantly measure gene expression, the influence of genetic variation on gene expression allows the use of reference datasets (e.g., GTEx [18]) to predict gene expression given a set of genotypes and to subsequently identify new disease-associated genes [11, 13, 19]. Thus, the transcriptome-wide association study (TWAS) approach has been established to identify genes whose expression is associated with complex traits by integrating genetic and transcriptional variation [20, 21]. Instead of testing millions of SNPs in GWAS, TWAS evaluates the association among the predicted expression of thousands of genes, which greatly reduces the burden of multiple comparisons due to statistical inference. This approach has been shown to have the potential to identify the genes responsible for GWAS-identified associations for complex traits and provide mechanistic insight regarding genes that are regulated via disease-associated genetic variants [22,23,24].

In this study, we conducted a transcriptome-wide association study to identify genes associated with osteoporosis by integrating gene expression data from the Genotype-Tissue Expression (GTEx) and GWAS summary data from the Genetic Factors for Osteoporosis (GEFOS) Consortium and then evaluated the biological patterns of expression-trait associations using the COLOC method. We then used VarElect to understand the biological function of the associations between the significant genes identified by the TWAS and osteoporosis. In comparison to the results from the differential analysis of multiple gene expression profiles of osteoporosis, we further verified the causal associations between the estimated bone mineral density and the significant genes identified by the TWAS.

Methods

GWAS summary datasets of osteoporosis

The GWAS summary statistics for osteoporosis were derived from the GEFOS Consortium website in December 2018. The phenotypic features of osteoporosis were measured based on the bone mineral density estimated from quantitative heel ultrasounds. The large-scale GWAS analysis of eBMD values was performed with a cohort of 426,824 participants (55% female) from the UK Biobank [9]. Briefly, a GWAS was performed based on the HRC imputation panel (hg19), which includes approximately 14,000,000 SNPs with MAF ≥ 0.05% and acceptable imputation quality (info score > 0.3). A detailed description of the sample characteristics, experimental design, and statistical analysis was previously published [25].

Integration of GWAS and gene expression

To integrate the GWAS results and gene expression, we used the TWAS method. TWAS integrates information from expression reference panels (SNP−gene expression correlation), GWAS summary statistics (SNP−phenotype correlation), and linkage disequilibrium reference panels (SNP−SNP correlation) to assess the association between the cis-genetic component of an expression and a trait (expression−osteoporosis correlation) [20, 22]. In practice, the effect sizes of cis-SNP expression in the 500-kb loci region were estimated using a sparse mixed linear model [26]. The TWAS used pre-computed gene expression weights combined with GWAS summary statistics to calculate the association effect for each gene with the disease.

To select trait-related tissues, we used TSED_DB [27], a reference database for trait-associated tissue specificity based on GWAS results. The UK Biobank heel eBMD-related genes were significantly enriched in muscle-skeletal tissue. The muscle-bone is a unit of functional interaction, and the muscle mass is positively correlated with the BMD [28]. Several studies have shown that low muscle mass is significantly associated with osteoporosis in both men and women of all age groups [29], and age-related low muscle mass might increase the risk of osteoporotic hip fractures [30]. In this study, the gene expression weights were pretrained on GTEx v7 muscle-skeletal tissue dataset, and we derived the weights from the FUSION website. The genes with significant association signals were identified based on a p value < 3.7E-06 after strict Bonferroni correction.

Evaluation of trait−gene expression associations

To evaluate the reliability of the TWAS results and understand the biological mechanisms of trait−gene expression associations, we used the COLOC method [31], which utilizes asymptotic Bayes factors with summary statistics and a regional linkage disequilibrium structure to estimate five posterior probabilities: no association with either a GWAS signal or eQTL (PP0), association with a GWAS signal only (PP1), association with a signal eQTL only (PP2), association with a GWAS signal and eQTL with two independent SNPs (PP3), and association with a GWAS signal and eQTL having one shared SNP (PP4). For each of the GWAS hits, we defined a 500-kb region at either side of the index variant and tested for colocalization within the entire cis−region of any overlapping eQTLs (transcription start and end position of an eQTL gene plus and minus 500 kb, as defined by GTEx) in muscle–skeletal tissue from GTEx v7. In this study, we used default priors in which a random variant in the region is associated with either a GWAS or an eQTL individually (prior probabilities = 1E−04) and set the prior probability that the random variant is causal to both GWAS and eQTLs (prior probability = 1E−06). Several studies have shown that PP3+PP4 > 0.8 is a cut-off threshold that provides evidence of colocalization [32, 33]. We used strict thresholds: PP3 > 0.9 for evidence of trait−gene expression associations caused by multiple distinct causal variants from a GWAS and an eQTL and PP4 > 0.8 for evidence of trait−gene expression associations caused by a joint signal from a GWAS and an eQTL [34].

Assessment of gene−disease associations

To investigate the likelihood that functional genes are more likely to be causal, the associations of biological function between the candidate genes and osteoporosis were assessed using VarElect [35, 36], a cutting-edge Variant Election application for disease/phenotype-dependent gene variant prioritization. VarElect provides a robust algorithm for ranking genes within a shortlist, noting their likelihood to be associated with the disease of interest, and producing a list of prioritized, scored, and contextually annotated genes and direct links to supporting evidence and additional information. VarElect utilizes the deep LifeMap Knowledgebase to infer the “direct” or “indirect” association of biological function between genes and phenotypes. A “direct” association between genes and disease has been supported by many studies showing that genes can directly affect disease development, and an “indirect” association between genes and disease is based on shared pathways, protein-protein interaction networks, paralogy relationships, domain-sharing, and mutual publications.

Protein−protein interaction (PPI) network and pathway enrichment analysis

The functional networks of genes that were found to be significantly associated with osteoporosis by the TWAS were further validated using the STRING and CluePedia tools. STRING (Search Tool for the Retrieval of Interacting Genes) is an online tool designed to evaluate PPI networks [37, 38], and CluePedia is a plugin of Cytoscape software that searches for potential genes associated with certain signaling pathways by calculating linear and nonlinear statistical dependencies from experimental data [39, 40]. The PPI networks of the significant genes identified by the TWAS were constructed using STRING. The functional pathways were detected and visualized using CluePedia.

Differential analysis of gene expression

To further validate the functional causality of candidate genes, the Gene Expression Omnibus (GEO) database and European Molecular Biology Laboratory (EMBL-EBI) database were searched to identify gene expression profiling studies of subjects with osteoporosis. The following key search terms were used: “osteoporosis,” “gene expression,” and “microarray.” We obtained gene expression profiles from four different sources and included original microarray studies that analyzed the differential gene expression profiles between patients with osteoporosis and normal controls, as shown in Table 2. The existence of heterogeneity among multiple microarray studies arising from different microarray platforms, gene nomenclature, and clinical samples makes it infeasible to compare the gene expression data directly. Therefore, normalization is necessary to minimize heterogeneity. Consequently, we performed a robust multiarray average approach [41] for background correction and normalization. The original GEO data were then converted into expression measures. The Limma package [42] was used to identify the differentially expressed probe sets between patients with osteoporosis and normal controls. Gene-specific t tests were performed, and p values were calculated. Multiple testing adjustment was performed, and the genes with adjusted p values < 0.05 were selected as differentially expressed genes (DEGs).

Results

TWAS-based identification of candidate genes for the treatment of osteoporosis

We first used the TWAS method with GWAS summary data from the GEFOS consortium to identify candidate genes associated with osteoporosis. In this study, we used the eBMD GWAS summary dataset rather than fragility fractures because fragility fractures were not found to be enriched by the TWAS, as shown in Supplementary Table 7. A gene expression reference panel for muscle-skeletal, which has a total of 13,416 expressed genes, was used. The TWAS identified 204 significantly associated genes with a p value < 3.7E-06, as shown in Fig. 1.

Fig. 1
figure 1

Manhattan plot of the results from the TWAS (upper panel) and GWAS (lower panel) of osteoporosis. The transcriptome-wide significance threshold was p value = 3.7E-06; the genome-wide significance threshold was p value = 6.6E-09. A total of 1103 conditionally independent SNPs at 515 loci among n = 426,824 UK Biobank participants passed the criteria for genome-wide significance

The TWAS method can detect causal genes by effectively predicting genetic variants based on gene expression. The following four biological patterns were identified by the TWAS (Fig. 2). First, for SNPs in coding regions (introns and exons) significantly associated with osteoporosis, the causal genes identified by the GWAS and the TWAS were likely to be consistent, as shown in Fig. 2a. The effect size of rs10411210 (PGWAS = 1.6E-119) on osteoporosis obtained from the GWAS corresponds with that of rs10411210 on RHPN2 (PTWAS = 4.4E-73) gene expression identified from the TWAS. Second, for SNPs in noncoding regions, the candidate genes might be close to the significant eQTLs but different from the GWAS hits, as shown in Fig. 2b. The variant rs2785197 (PGWAS = 6.5E-44) in 11p13 mapped to PDHX in GWAS, but the causal gene for rs2785197 in our TWAS results was more likely to be CD44 (PTWAS = 1.1E-32). The colocalization analysis showed that CD44 (PP4 = 0.99 in Supplementary Table 2) gene expression was regulated by the single variant rs2785197, which might be regarded as its expression regulation element. Third, the candidate genes might be regulated by relatively distant significant SNPs in noncoding regions, as shown in Fig. 2c. Our TWAS results indicated that rs4792909 (PGWAS = 1.5E-74) in 17q21.31 might be associated with G6PC3 (PTWAS = 4.2E-26). The distance between rs4792909 and G6PC3 is 387 kb, but we did not find the gene identified by the GWAS near rs4792909. Fourth, candidate genes were discovered based on SNPs that were not significantly associated with osteoporosis. The nonsignificant region identified from the GWAS was a novel discovery: rs1003260 (PGWAS = 3.6E-08) in 6q13 was associated with RIMS1 (PTWAS = 2.1E-08), as shown in Fig. 2d. RIMS1, as a novel locus, was first reported to be associated with BMD, and further investigation was performed.

Fig. 2
figure 2

Biological patterns identified by the TWAS. a): For significant SNPs in coding regions, rs10411210 (PGWAS = 1.6E-119) in 19q13.11 is associated with RHPN2 (PTWAS = 4.4E-73). b): For SNPs in the noncoding regions, rs2785197 (PGWAS = 6.5E-44) in 11p13 was associated with PDHX, which was marked in green, as determined by the GWAS, but the causal gene for rs2785197 was more likely to be CD44, which is marked in red (PTWAS = 1.1E-32), as determined by our TWAS. c): rs4792909 (PGWAS = 1.5E-74) in 17q21.31 might be associated with G6PC3 (PTWAS = 4.2E-26). The distance between rs4792909 and G6PC3 is 387 kb, but no gene has been identified by a GWAS near rs4792909. d): rs1003260 (PGWAS = 3.6E-08) in 6q13 was associated with RIMS1 (PTWAS = 2.1E-08)

Gene expression differences identified by TWAS might be causally associated with the phenotype of interest but can also be due to variant linkage disequilibrium or gene product co-expression [43, 44]. To pinpoint the causal relationship between the target gene of an eQTL and a complex trait, we performed a colocalization analysis using the COLOC method; see the Methods section. We used a strict threshold for single variant colocalization with PP4 > 0.8 and a stricter threshold for multiple variants colocalization with PP3 > 0.9 as this category seems slightly inflated than PP4; see QQ plot in Supplementary Figure 4. The results showed that 103 TWAS associations provided strong evidence of joint causal variants with PP3 > 0.9, as shown in Supplementary Table 1, and 101 showed evidence of a single causal variant with PP4 > 0.8, as shown in Supplementary Table 2.

Compared with previous GWAS studies, we found that 51 of the identified genes were previously implicated in osteoporosis risk by GWASs, as demonstrated in the literature, and 153 genes have not been reported to be associated with osteoporosis risk in previous GWASs, as shown in Figs. 3a–b.

Fig. 3
figure 3

Significant genes and candidate genes in muscle-skeletal tissue identified by TWAS. (a) Comparison of significant genes found using the TWAS and GWAS methods. (b) The top 20 candidate genes were not reported in previous GWASs, the red bars indicate upregulated gene expression, and the blue bars indicate downregulated gene expression (full lists can be found in Supplementary Figure 1, Supplementary Table 1 and Supplementary Table 2).

Assessment of the candidate gene−osteoporosis associations

For 153 candidate genes, we evaluated the associations between the candidate genes and osteoporosis through an analysis using VarElect. The analytical results showed that 20 genes (Supplementary Table 3) were “directly” associated, 83 genes were “indirectly” associated (Supplementary Table 4), and the remaining genes have not yet been classified. The direct associations indicated that the target genes were supported by rich evidence (the relevant literature, gene function annotation, etc.). The score shown in Supplementary Table 3 indicated the strength of the association between the gene and osteoporosis: a higher score indicates stronger evidence. Indirectly associated genes might interact with intermediaries to influence the development of osteoporosis through a PPI network and pathways (Supplementary Table 5). We considered the remaining unidentified genes as novel candidate genes, which were mainly lncRNAs, pseudogenes, and antisense genes. These novel candidate markers are potential disease factors for which there is no available evidence and thus need further investigation.

Functional pathways of the candidate genes

To further verify the associations between the significant genes identified by the TWAS and osteoporosis, we explored the biological function pathways of these genes using the STRING and CluePedia tools. Four pathways that may promote the understanding of the mechanism of osteoporosis were enriched (adjusted p value < 0.05), as shown in Table 1. However, eBMD-related genes have mostly not yet been well studied, and few overlaps were found with the KEGG database. Other nonsignificant KEGG pathways may also have important roles on osteoporosis and provide additional clues, as shown in Supplementary Table 5. Among them, some of the pathways which interacted with each other were shown in Supplementary Figure 5 (e.g., PI3K-Akt signaling, focal adhesion, and ECM-receptor interaction ). These results showed that the significant genes identified by the TWAS are involved in many biological mechanisms in the development of osteoporosis.

Table 1 Functional pathways of significant genes identified by the TWAS

Functional validation for the candidate genes

Previous research based on expression profiling with gene signatures of cellular models to characterize the gene’s involvement in bone metabolism and disease processes revealed that impaired osteoblastic differentiation reduces bone formation and causes severe osteoporosis in animals [45]. We analyzed four gene expression profile datasets from bone, bone marrow, monocyte cells, and B cells of patients with osteoporosis and normal controls and high- and low-BMD control groups. Based on the cut-off criterion for the identification of DEGs (adjusted p value < 0.05), a total of 15 significant genes identified by the TWAS were duplicated, and 11 of these genes were not found by the GWAS, as shown in Table 2. The results of the functional pathway analysis also supported our findings. As shown in Fig. 4, we discovered that four differentially expressed genes were enriched in four KEGG pathways that were significantly and strongly associated with osteoporosis. SLC11A2, which is enriched in the mineral absorption pathway (Ppathway = 0.019), regulates the fine-tuned balance between bone resorption and bone formation and thus affects bone density [46]; MAP2K5 is enriched in the MAPK signalling pathway (Ppathway = 0.388), which is involved in the regulation of many cellular physiological functions, such as proliferation, differentiation, inflammation, and apoptosis, and affects bone formation [47, 48]; NFATC4 is enriched in the Wnt signalling pathway (Ppathway = 0.104) and is a candidate for therapeutic intervention aimed at increasing bone mass and strength in treated patients [49, 50]; HSP90B1 is enriched in the PI3K-AKT signalling pathway (Ppathway= 0.020), which is involved in the inhibition of osteoporosis through the promotion of osteoblast proliferation, differentiation and bone formation [51, 52]. Therefore, we inferred that these genes are very likely to be the causal pathogenic genes of osteoporosis. Due to the small sample size of the mRNA expression datasets, more experiments and other types of RNA datasets are needed in the future.

Table 2 Significant genes identified by the TWAS that show significantly differential gene expression in the four gene expression profile datasets. The red marker genes were not identified by the GWASs
Fig. 4
figure 4

Biological function verification of the significant genes identified by the TWAS. SLC11A2, NFATC4 and HSP90B1 showed significantly differential expression in bone tissue and bone marrow cells between patients with osteoporosis and normal controls. MAP2K5 gene expression is significantly downregulated in B cells of low-BMD samples

OP osteoporosis, NOR normal, BMD bone mineral density

Discussion

Multiple GWASs have been performed with considerable sample sizes to detect osteoporosis heredity, but the progress toward understanding the mechanism of the disease is limited. Most GWAS hits are in noncoding regions, and it is difficult to understand downstream biological inferences. In most cases, the nearest genes are usually reported [53, 54]. SNPs in noncoding regions do not have to regulate genes based on the distance between SNPs and genes. The integration of GWAS and transcriptome data will empower novel discovery and, most importantly, pinpoint causality. The TWAS method calculates local SNP−gene expression correlations and further calculates the likelihood of gene causality. Therefore, for a significant SNP in a coding region, the causal genes identified by using the GWAS and TWAS methods should be and indeed are consistent, as shown in Fig. 2a. For SNPs in noncoding regions, the causal genes might be close to the significant eQTLs, which might differ from the GWAS hits shown in Fig. 2b. The most relevant GWAS variants and their nearest genes were not enriched as causal variants/genes by the TWAS. Accordingly, we compared the GWAS-reported genes and the TWAS-enriched genes in all significant GWAS regions, as shown in Supplementary Table 8. The TWAS method can even discover causal genes in regions with no significant GWAS hits, as shown in Fig. 2d, and relatively distant significant SNPs, as shown in Fig. 2c. More valuable region plots can be found in Supplementary Figure 2.

We found 204 significant candidate genes through our TWAS. Among these genes, 103 genes were regulated by two distinct causal variants, and 101 genes were regulated by a single causal variant. In comparison with the GWAS, 51 genes were duplicated. For the remaining 153 genes, an analysis of their biological functions revealed that 20 genes directly affected pathways closely related to the development of osteoporosis: IBSP, EIF2B2, CD44, FEN1, UBA7, MARCO, ATF1, CBFB, G6PC3, SLC11A2, MST1R, PLEKHM1, ATRIP, CCDC36, AKAP7, EPRS, CTSB, CRHR1, FADS1, and MAP1LC3A. For example, IBSP (score = 13.74) is the gene most associated with osteoporosis. The COLOC analysis showed that the SNPs rs1471403 and rs1054627 might co-regulate IBSP gene expression. In addition, previous studies have shown that IBSP is expressed in all major bone cells, including osteoblasts, osteocytes, and osteoclasts [55], and encodes a major noncollagenous bone matrix protein that binds to calcium and hydroxyapatite via acidic amino acid clusters in the PI3K-AKT signaling pathway [51]. In contrast, we found that 83 genes appear to exert their biological functions to affect the development of osteoporosis through a PPI network. As shown in Supplementary Figure 3, RAC3 and NFATC4 were enriched in the MAPK signaling pathway (Ppathway = 0.388) through interactions with genes (ESR1, FOS, IGF1, TGFB1, JUN, NFATC1, IGF1, LRP5, TNF, and PRKACA) that are known to be associated with osteoporosis. More information on gene interactions can be found in Supplementary Table 4 and Table 5. In addition, 50 significant markers as novel candidate genes are not associated with osteoporosis based on existing knowledge, and these include 19 genes, 13 lincRNAs, 9 pseudogenes, and 9 antisense genes. Some of these candidate genes, such as AF131215.2 (PTWAS = 1.92E-66) and RP11-73 M18.6 (PTWAS = 2.72E-51), were very significant. Simultaneously, we found that RIMS1 was located in new locus, and its causal SNPs were non-significantly associated with osteoporosis in the GWAS. RIMS1 is an RAS gene superfamily member and plays a role in the regulation of voltage-gated calcium channels during neurotransmitter and insulin release. Although previous studies have not provided evidence supporting a causal association with osteoporosis, these genes might be potential causal biomolecules for osteoporosis, and more experiments are needed to verify their biological function.

Furthermore, we obtained additional evidence by comparing differentially expressed genes through an analysis of four types of gene expression profiles. Our results identified 15 significantly differentially expressed genes, as shown in Table 2. Among them, 11 genes were not discovered by the GWAS, and these included SLC11A2, G6PC3, and MAP1LC3A, which have been proven to be directly associated with osteoporosis, and NFATC4, HSP90B1, TP53I13, MTMR9, PCGF2, MAPT-AS1, and MAP2K5, which are considered to be indirectly associated with osteoporosis based on PPI networks and literature. It is worth mentioning that SLC11A2, NFATC4, HSP90B1, and MAP1LC3A were enriched in four very important pathways. In addition, GPATCH1, SPTBN1, DPP8, and ISYNA1 were also found in the GWAS, and our results once again confirmed their potential as candidate disease markers. The biological function information of these 15 genes can be found in Supplementary Table 6.

This investigation constitutes the largest study integrating the GWAS and TWAS methods to identify osteoporosis susceptibility genes. We used data from the 426,824 individuals with osteoporosis in the GWAS and 860 samples from GTEx in our analyses. Many findings were discovered, although this research still has some limitations. First, the current TWAS method cannot explain the variants influencing disease that are independent of cis expression because it was only trained on cis-eQTL analysis. Second, some bias might exist due to the use of normal muscle-skeletal tissues from GTEx to make predictions. Third, tissue sensitivity and tissue specificity are important issues to consider when performing a TWAS. Prediction models built on gene expression data from osteoblast cells of osteoporosis patients will help identify additional candidate genes associated with osteoporosis [56].

In summary, we integrated data from GWAS and transcriptome expression to identify 204 significant genes associated with osteoporosis. One hundred fifty-four genes have been previously associated with osteoporosis (literature, protein-protein interaction networks, pathway, etc.), and 50 genes have not been previously discovered. Therefore, we analyzed the biological patterns of those loci and explained their pathway interactions. We hope that our findings will provide novel insights for future pathogenetic studies of osteoporosis.