Introduction

Uterine fibroids or leiomyomata are common, benign tumors of the uterus with an estimated lifetime risk of 77% by menopause (Baird et al. 2003). African Americans (AA) are more likely to have fibroids than women of European ancestry (EA), with AA having greater than 80% incidence of fibroids by menopause compared to nearly 70% for EA (Baird et al. 2003). AA women also have larger and more numerous fibroids as well as a younger age-of-onset on average (Baird et al. 2003). In addition to race/ethnicity (Baird et al. 2003; Faerstein et al. 2001; Marshall et al. 1997), there are other well-characterized risk factors for fibroids, including early age at menarche (Baird and Dunson 2003; Cha et al. 2011; Faerstein et al. 2001; Luoto et al. 2000; Moore et al. 2008), being overweight (BMI 25–29 kg/m2) (Baird et al. 2007; Moore et al. 2008; Terry et al. 2007; Wise et al. 2005a), and older premenopausal age (Baird et al. 2003; Moore et al. 2008). In addition, higher parity is associated with reduced fibroid risk, likely due to pregnancy-related hormonal and uterine changes (Baird and Dunson 2003). Symptoms of uterine fibroids may include pelvic pain and abnormal or heavy menses, though many fibroids are asymptomatic (Baird et al. 2003; Borah et al. 2013; Vollenhoven 1998). The lack of overt symptoms makes imaging crucial for classification of case/control status, as up to 51% of women may be misclassified by self-report (Baird et al. 2003; Myers et al. 2012). We have developed and validated a phenotyping algorithm to classify fibroid case/control status using electronic health records (Feingold-Link et al. 2014). This algorithm requires pelvic imaging for identification of cases and controls, which reduces misclassification of both cases and controls.

Several lines of evidence suggest genetic risk factors influence fibroid development. Estimates of the heritability of fibroids from familial aggregation and twin studies range from 26 to 69% in European populations (Luoto et al. 2000). Racial disparities in age at onset, number, size, and lifetime incidence of fibroids by menopause (Baird et al. 2003) also strongly support a role for genetic factors. There has been one published genome-wide association study (GWAS) of uterine fibroids, which was performed among Japanese subjects (Cha et al. 2011). Genome-wide linkage and follow-up association studies in a meta-analysis of EA women implicated an additional locus for risk of fibroid diagnosis (Eggert et al. 2012). The loci implicated in these previous studies (SLK, BET1, TNRC6B, and FASN/CCDC57) have been replicated among EAs (Aissani et al. 2015a; Edwards et al. 2013b), which have been evaluated but failed to replicate in AAs (Aissani et al. 2015b; Wise et al. 2012). The predominant studies conducted among AA subjects have involved admixture mapping, which has shown significant regions of increased African ancestry in cases, particularly at 1q42.2 (Zhang et al. 2015), 4p16, and 10q26 (Wise et al. 2012). The aim of this work was to perform the first GWAS for fibroids risk among AA women using image-verified fibroids for discovery, with replication in cohorts that collected data on self-reported fibroids.

Methods

Study populations

Individuals with imaging-confirmed uterine fibroids and genome-wide genotype data were included from the following studies in Stage 1: Vanderbilt University BioVU, Mt Sinai, BioME and the Coronary Artery Risk Development in Young Adults (CARDIA) Women’s Study (CARDIA-WS) for a total of 1273 cases and 1379 controls. All studies received Institutional Review Board approval at their respective institutions and written informed consent was obtained for all participants.

BioVU is an electronic health record (EHR) biorepository at Vanderbilt University Medical Center, Nashville, TN and was designed to link clinical data available from de-identified EHRs to DNA specimens. Methods have been previously described (Pulley et al. 2010). BioME is the ongoing, consented EHR-linked biobank at the Institute for Personalized Medicine at the Icahn School of Medicine at Mt. Sinai. A subset of the available BioME samples were genotyped as part of the Electronic Medical Records and Genetics (eMERGE) Network and are referred to in this instance as Mt. Sinai, while genotyped samples acquired through other means were analyzed separately and will be henceforth referred to as BioME. The phenotyping algorithms used to identify cases and controls in BioVU, Mt. Sinai, and BioME have been previously published (Feingold-Link et al. 2014). Briefly, the phenotyping algorithm required at least one instance of pelvic imaging with a diagnosis code for fibroids to define cases among women aged 18 and over. For controls, at least two instances of pelvic imaging were required, on different dates, with no diagnosis or procedure codes indicating a fibroid, fibroid treatment, or hysterectomy. The CARDIA study is a prospective multi-center study with 5115 adult European ancestry and AA participants of the age group 18 to 30 years, recruited from four centers. Details of the CARDIA study design have been previously published (Friedman et al. 1988). The CARDIA Women’s Study performed standardized study ultrasounds on women from CARDIA to detect the presence or absence of fibroids, as previously described (Wellons et al. 2008).

Stage 2 data were provided by the direct-to-consumer genetic testing company 23andMe, which used a survey for determining case or control status. Specifically, cases were defined as females responding positively to either of the following questions: “Has a doctor ever told you that you have uterine fibroids?”, or “Have you ever been diagnosed with uterine fibroids?”. Controls were defined as females responding with a ‘no’ to either of the questions above. Age at enrollment with 23andMe was also collected (mean ages of 56 in cases and 47 in controls).

In addition, in Stage 2, GWAS samples were provided by the Black Women’s Health Study (BWHS), a U.S. prospective cohort study of premenopausal women (aged 22–50 years) in which uterine fibroid diagnoses were ascertained by self-report, with medical record validation among a subset of cases (Wise et al. 2004, 2005b). BWHS participants reported whether they had been diagnosed with fibroids on the follow-up questionnaires every 2 years (1997–2009). The women reported the calendar year of their first diagnosis, and whether the diagnosis was confirmed by ultrasound or surgery. Controls were premenopausal women who reported no diagnosis of fibroids through 2009. Among those who released their medical records, 96% of self-reported fibroid cases were confirmed (Wise et al. 2005b). The study workflow is diagrammed in Fig. 1.

Fig. 1
figure 1

Study workflow diagram

Genotyping

BioVU samples were genotyped using both the Affymetrix BioBank array and the Axiom World array 2 (Affymetrix Inc., Santa Clara, CA, USA) to attain better coverage for African-derived variants. BioME and the Mt. Sinai eMERGE site used the Illumina 1 M (Illumina Inc., San Diego, CA, USA) platform, with an exome chip backbone included for BioME samples. CARDIA-WS was genotyped on the Affymetrix 6.0 array (Affymetrix Inc., Santa Clara, CA, USA) as part of the Candidate-gene Association Resource which has been previously described in detail (Lettre et al. 2011).

The Stage 2 23andMe samples (1744 cases and 2906 controls) were genotyped on a custom GWAS panel across four versions. Stage 2 participants from BWHS (382 cases and 392 controls) were genotyped on the Illumina Infinium Expanded Multi-Ethnic Genotyping Array (MEGA; Illumina, Inc., San Diego, CA, USA) at Vanderbilt University VANTAGE Core Genotyping facility.

Quality control

Genotype quality control was performed within each study population, and a uniform protocol was implemented for all Stage 1 studies using PLINK (Purcell et al. 2007), including a 95% SNP and individual call rate threshold, removal of first-degree related individuals, sex checks, alignment of alleles to the genomic ‘+’ strand, and visualization of ancestry by principal components analysis using the Eigenstrat software (Price et al. 2006). No samples were excluded from analyses based on principal components.

For 23andMe in Stage 2, SNPs with Hardy–Weinberg equilibrium p value <10−20 were excluded, as were those with call rate <95%, or with large allele frequency discrepancies compared to European 1000 Genomes reference data. For BWHS in Stage 2, genotype quality control was also performed in PLINK and consisted of 95% SNP and 90% individual call rate thresholds, removal of first-degree related individuals, sex checks, alignment of alleles to the genomic ‘+’ strand, and visualization of ancestry by principal components analysis using the Eigenstrat software (Price et al. 2006). No samples were excluded from analyses based on principal components.

Statistical analysis

All demographic data were summarized and evaluated with Stata 14.0 (College Station, TX, USA).

The Stage 1 data were imputed to the 1000 Genomes phase 3 reference panel using SHAPEIT2 (Delaneau et al. 2012) for haplotype phasing and IMPUTE2 (Duan et al. 2013) for genotype imputation, with phasing and imputation performed separately for each data set. SNPTEST2 (Marchini et al. 2007) was used to perform single-variant association analyses for each cohort independently. Analyses were adjusted for the first five principal components in each sample to account for population substructure (Supplemental Fig. 1). Additional analysis adjusting for age was also performed, as was an analysis to avoid misclassification errors by restricting controls to peri-menopausal women (age ≥45) to remove women who were at increased risk of developing fibroids.

The Stage 2 data from 23andMe were imputed to the March 2012 “v3” release of 1000 Genomes reference haplotypes. Phasing and imputation were performed for each genotyping platform version separately, using Beagle4 (version 3.3.1) to phase batches of 8000–9000 individuals across chromosomal segments of no more than 10,000 genotyped single nucleotide polymorphisms (SNPs), with overlaps of 200 SNPs, and a high-performance version of Minimac5 for imputation of each phased segment against all-ethnicity 1000 Genomes haplotypes, using five rounds and 200 states for parameter estimation. Logistic regression was modeled using an additive genetic model adjusting for age and five principal components, reporting likelihood ratio test p values for association with genotypes.

Stage 2 also utilized 776 samples from BWHS which were imputed to the 1000 Genomes phase 3 reference panel using SHAPEIT2 (Delaneau et al. 2012) for haplotype phasing and IMPUTE2 (Duan et al. 2013) for genotype imputation, with phasing and imputation performed separately for each data set. SNPTEST2 (Marchini et al. 2007) was used to perform single-variant association analyses. Association analysis was performed using logistic regression of the imputed genotype data. Analyses were adjusted for ten principal components.

All association results for Stage 1 and across Stages 1–2 were meta-analyzed using fixed effects inverse-variance weighted meta-analysis in METAL (Willer et al. 2010). The final meta-analysis of all variants including all stages consisted of 3399 cases and 4764 controls.

To further evaluate the genetic association results in the context of gene expression, we employed the novel method MetaXcan (Barbeira et al. 2016), an extension of the PrediXcan method (Gamazon et al. 2015). PrediXcan conducts a test of association between phenotypes and gene expression levels predicted by genetic variants in a library of tissues from the Genotype-Tissue Expression (GTEx) project (2015; Carithers et al. 2015; Mele et al. 2015). MetaXcan is a meta-analysis approach that conducts the PrediXcan test using genotype association summary statistics, rather than performing the tests in individual-level data. For the purposes of this study, we utilized covariance matrices built for nine relevant tissues from GTEx (i.e., uterus, ovary, breast, vagina, subcutaneous adipose, visceral omentum adipose, thyroid, whole blood, and transformed fibroblasts) to annotate SNP association signals as well as to provide information about likely tissue expression patterns and relevant biological information.

Results

We used our previously validated EHR algorithm to define 999 image-confirmed cases and 1233 image-confirmed controls for fibroids from BioVU and Mt. Sinai (including the Mt. Sinai eMERGE subset as well as other cases and controls from the remainder of the BioME resource) sites. We also included additional samples from the CARDIA Women’s Study, which incorporated a standardized research ultrasound, for a total of 1273 case and 1379 control samples for analysis in Stage 1 discovery analyses (Fig. 1). Descriptive characteristics of each study are presented in Table 1. Briefly, cases and controls were of similar ages within each individual study in Stage 1 (no more than 4 year difference in means), though the Mt. Sinai sample was substantially older on average than the other three studies. However, comparing ages across studies is challenging due to the differing study designs (case–control vs. cohort) and incident periods.

Table 1 Demographic characteristics of study populations

All data sets were genotyped on study-specific genome-wide genotyping arrays, imputed to the 1000 Genomes reference panel, and analyzed for association with fibroids, adjusting for principal components of ancestry (Supplemental Table 1). Meta-analysis of both Stages (3399 cases and 4764 controls) revealed a single genome-wide significant locus near CYTH4 (cytohesin 4) on chromosome 22 (Table 2; bottom panel of Fig. 2). The lead SNP at this locus, rs739187 [p value = 7.82 × 10−9, OR [95% confidence interval (CI)] = 1.23 (1.16–1.30)] had low heterogeneity across cohorts.

Table 2 Summary of genome-wide significant and suggestive (<5 × 10−7) associations in GWAS meta-analysis
Fig. 2
figure 2

Top panel presents (as a mirror of the association results) the association with predicted gene expression levels from relevant tissues from GTEx, also arranged by chromosome and position; bottom panel presents SNP association results from meta-analysis across both stages in an inverted Manhattan plot format, arranged by chromosome and position

To further evaluate the genetic association results in the context of gene expression, we employed the novel method MetaXcan which conducts a test of association between phenotypes and gene expression levels predicted by genetic variants in a library of tissues from the GTEx project (GTEx Consortium et al. 2015; Carithers et al. 2015; Mele et al. 2015). Analysis with MetaXcan to evaluate association with genetically predicted gene expression (GPGE) levels resulted in 1542 results across 9 GTEx tissues with p value <0.05, from a total of 44,577 comparisons (Table 3; upper panel of Fig. 2). Examination of the locus implicated by the most significant genetic result, CYTH4, revealed a significant association with reduced GPGE at that gene in thyroid tissue (p value = 5.86 × 10−8). A striking result from this analysis was association between zinc finger protein 391 (ZNF391) GPGE in eight of the nine tissues and uterine fibroid risk (smallest p value = 5.05 × 10−5 in breast tissue).

Table 3 Genetically predicted gene expression results from GWAS summary statistics

Discussion

In this first multi-stage GWAS of uterine fibroids among AA women, we found a genome-wide significant result with linkage disequilibrium support on chromosome 22 near the CYTH4 gene (Fig. 3). Not a great deal is known about this gene, although variants near this locus have been previously suggestively associated with methotrexate response in juvenile idiopathic arthritis (Cobb et al. 2014), and a primate-specific repeat in the promoter of CYTH4 has been linked to bipolar disorder (Rezazadeh et al. 2015). CYTH4 is most highly expressed in whole blood and spleen based on observations in GTEx and only expressed at low levels in all other measured tissues. CYTH4 expression has also been observed to be low in myometrial tissue postpartum (Kanamarlapudi et al. 2012). The top signal from the GPGE analysis also identified predicted reduced CYTH4 in thyroid tissue as associated with fibroid risk, supporting a potential biological role for this gene. Women with fibroids have previously been shown to often have concurrent thyroid conditions, including overt hypothyroidism (Ott et al. 2014), thyroid nodules (Kim et al. 2010; Spinos et al. 2007), and thyroid cancer (Braganza et al. 2014; Makaridze and Mardaleishvili 2011). In addition, expression of genes related to fibroids (HMGA2 and PLAG1) was found to be correlated between fibroid and thyroid tumors (Klemke et al. 2014), which may provide an explanation for the observations in this study.

We also observed suggestive evidence of association (p ≤ 5 × 10−7) at a variant between LY6D (lymphocyte antigen 6 family member D) and JPH1 (junctophilin 1). The lead SNP in this region, rs6472827, is located 25 kb upstream of the coding region for JPH1 on chromosome 8, and is a cis-eQTL in both sun exposed and non-sun exposed skin tissue from GTEx. However, the GPGE results implicated decreased LY6D expression in breast tissue as associated with fibroid risk in the region (p value = 1.52 × 10−4). Increased expression of this gene has been observed in a variety of cancer types, including breast, compared to normal non-cancerous counterpart tissues (Luo et al. 2016).

Evaluation of the genetic association results in the context of GPGE across a variety of tissues provided supportive evidence for the biological mechanism underlying the genetic data. Perhaps most striking among GPGE results was ZNF391, which encodes zinc finger protein 391, which was suggestive in eight of the nine tissues (and nominal in uterus, p value = 0.03). This gene has previously been implicated in rheumatoid arthritis (Orozco et al. 2011), and more interestingly, is expressed in several reproductive-related tissues in GTEx. The highest expression of ZNF391 is in testis and ovary, with uterine expression falling in the top third of tissues. Furthermore, a SNP in this gene (rs4713108) is suggestively associated with expression of ZNF391 in uterine tissue (p value = 1.1 × 10−7). This particular SNP was not associated with fibroid risk in our meta-analysis (p value = 0.08); however, there was a broad base of suggestive SNPs in the region with p values as low as 2.01 × 10−6 (Supplemental Fig. 2).

Another gene target identified through GPGE, ALDH2 (p value = 7.30 × 10−5), encodes a mitochondrial isoform of alcohol dehydrogenase, which is involved in alcohol metabolism. Five studies have shown a consistent relationship between increased alcohol consumption and fibroid risk (OR = 1.25–2.78) (D’Aloisio and Baird 2004; Marshall et al. 1997; Nagata et al. 2009; Templeman et al. 2009; Wise et al. 2004). Predicted decreased levels of AGT in thyroid tissue were also implicated as being associated with fibroid risk (p value = 2.49 × 10−5). AGT encodes angiotensinogen, which is involved in renal homeostasis, and given the observed link between hypertension and fibroids, may support a plausible biological connection between the two.

Previous GWAS for fibroids in Japanese subjects has implicated variants near the BET1L, TNRC6B, and OBFC1SLK genes (Cha et al. 2011). We did not observe suggestive associations in our AA subjects at the prior GWAS-associated SNPs near BET1L and OBFC1-SLK which were reported by Cha et al. (Supplemental Figs. 3–8). The best result at these loci was near TNRC6B (rs59426214; p value = 2.42 × 10−5; Supplemental Fig. 3), although this variant was not in LD with the index SNP (rs12484776) detected by Cha et al. This lack of strong association is largely consistent with prior published studies that have attempted to replicate the associations within cohorts of African ancestry.

A potential consideration of this study was that the Stage 2 studies relied upon self-reported fibroids information, which is known to be subject to misclassification bias. However, the top result at CYTH4 was suggestive in the imaged Stage 1 analysis with a p value of 6.46 × 10−7, and effect sizes were consistent between Stages 1 and 2. Additional points to consider are that the control group was not restricted to women beyond menopause, when they would have a negligible likelihood of developing new fibroids. We did explore this as a secondary analysis in the studies from Stage 1, albeit with reduced power due to the smaller sample size (data not shown). Restricting the controls to women peri-menopausal and younger (age <45) did not change the overall associations, as many of the upper tier signals were of similar effect size (r 2 = 0.87) despite the lack of significance due to reduced power. In addition, adjustment for age in Stage 1 did not markedly alter the results seen in the primary analysis, with several of the top overall signals remaining suggestively significant (6/12 variants with p value <1e−4) and having consistent effect sizes (r 2 = 0.96). Further considerations are the small numbers and heterogeneity among tissues included in the GTEx models. Uterus in particular had a very small number of tissue samples (N = 70), and is known to be a heterogeneous mix of specific cell/tissue types.

Fig. 3
figure 3

Regional association plot for CYTH4. There is substantial LD between the lead SNP rs739187 and other SNPs in the region

In conclusion, in this first multi-stage GWAS of uterine fibroids among AA women, we identified a novel genome-wide significant result in CYTH4 with biological support from the significant GPGE showing reduced CYTH4 expression in thyroid tissue. Further studies are needed to confirm the SNP and target loci and elucidate the biological role of this gene and others identified through GWAS and GPGE in relation with fibroid development (Edwards et al. 2013a). Overall, we have identified novel genetic signals, highlighting the importance of this work for understanding fibroids in AA women.