Background

Stroke has the largest racial disparity of any chronic disease with a striking disparity in the burden of stroke among individuals of African ancestry compared to other populations [1,2,3,4,5]. However, the genetic architecture of stroke in indigenous African populations is largely unknown [1, 6, 7]. Previous genome-wide association studies (GWAS) have identified important genetic variants associated with stroke risk in European and Asian ancestry populations, with sparse inclusion of African-American populations [8,9,10,11] (who have up to 80% African genetic admixture) [12, 13]. Despite this progress, the stroke genetic landscape remains incomplete. It is imperative to explore indigenous African populations because of the higher stroke heritability in African ancestry populations [14, 15]. The increased diversity of the African genome [16, 17] also improves the potential for making novel discoveries [18]. Moreover, the inclusion of African ancestry populations is vital to trans-ancestry meta-analysis with implications for fine-mapping of known stroke-associated loci, uncovering of novel loci, characterization of causal variants, design of polygenic risk scores, development of new targeted therapies, and personalized interventions for stroke in Africans and other global populations.

In a GWAS meta-analysis of stroke in > 22,000 individuals of African ancestry undertaken by the Consortium of Minority Population GWAS of Stroke (COMPASS)) (physician-adjudicated stroke patients = 3734 and no history of stroke = 18317), one single-nucleotide polymorphism (SNP rs55931441) near the HNF1A gene attained genomic significance, while variants in 24 additional unique loci including the SFXN4 and TMEM108 genes demonstrated suggestive associations [8]. In the most recent GIGASTROKE project which involved cross-ancestry GWAS meta-analyses of stroke and its subtypes in 110,182 stroke patients (33% non-European) and 1,503,898 control individuals from five ancestries, association signals were detected at 89 independent loci, and effect sizes were correlated across ancestries demonstrating consistent directionality even when significance was not attained. New drug targets were discovered [19]. However, no variants were described for indigenous African populations.

The Stroke Investigative Research and Education Network (SIREN) is the largest epidemiological study on stroke among indigenous Africans with dual goals of characterizing the dominant modifiable vascular risk factors [20] and unraveling potential unique genetic variants associated with stroke occurrence among West Africans. Herein, we report the findings of the first stroke GWAS performed in an indigenous African population of 3434 subjects (1691 ischemic stroke cases and 1743 stroke-free controls) from the SIREN Study. The report also includes an African ancestry meta-analysis combining summary statistics from the COMPASS Consortium (n > 22,000; 3734 cases, 18,317 controls) [8, 9] and a trans-ancestry meta-analysis with summary datasets from the MEGASTROKE [10] (521,612 individuals: 67,162 cases and 454,450 controls). We fine-mapped identified GWAS loci using PAINTOR. To understand the functional relevance of putative genes, we functionally annotated potential causal variants through the Cerebrovascular Disease Knowledge Portal [21] (https://cd.hugeamp.org/), the GTEx Portal (https://www.gtexportal.org), and chromatin interaction and eQTL analysis using Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA) [22, 23]. Additionally, we used the University of California, Santa Cruz (UCSC) browser to confirm the potential chromatin interactions in putative genes.

Methods

Patient enrollment and data acquisition

The rationale and design of the SIREN study have been described elsewhere [24]. In brief, the SIREN study was initiated in August 2014 as a multi-center case-control study with 16 sites in Nigeria and Ghana. The ethnographic characteristics of the study population are as previously described [25]. Ethical approval was obtained for all study sites, and informed consent was obtained from all subjects. Cases were consecutively recruited consenting adults (aged 18 years or older) with first clinical stroke within 8 days of current symptom onset or “last seen without a deficit” with confirmatory cranial CT or MRI scan performed within 10 days of symptom onset. Stroke-free controls were also recruited, and their status ascertained with a locally validated version of the Questionnaire for Verifying Stroke-Free Status (QVSFS) [26].

Relevant data were collected, including basic demographic and lifestyle data (ethnicity, native language of the subjects and their parents, socioeconomic status, dietary patterns, routine physical activity, stress, depression, cigarette smoking, and alcohol use). Cardiovascular and anthropometric measurements were obtained using standard techniques, and neurologic assessment was carried out to assess neurologic deficits and ascertain stroke severity using the National Institute of Health Stroke Severity Score. Blood samples were collected from all subjects at baseline for determination of parameters including fasting lipid profile, blood glucose, and HbA1c. Stroke diagnosis and phenotyping were undertaken as previously described [20]. Determination of stroke etiology (large vessel, small vessel, cardioembolic and undetermined) using the Trial of Org 10172 in Acute Stroke Treatment (TOAST) criteria (single dominant causative classification) was via a rigorous process of investigative evaluation including neuroimaging (CT/MRI), 12 – lead electrocardiography, echocardiography, and carotid doppler ultrasonography as previously described [20, 24].

Description of risk factors

Hypertension was defined as sustained systolic BP > 140 mmHg or diastolic BP > 90 mmHg after the onset of stroke, a history of hypertension, or taking antihypertensive medications before the stroke [20]. Diabetes mellitus was defined based on the previous history of diabetes mellitus, use of medications for diabetes mellitus, fasting glucose levels > 126 mg/dl, and/or HBA1c > 6.5% [20]. Dyslipidemia was defined following the recommendations of the US National Cholesterol Education Program as a high fasting serum total cholesterol > 200 mg/dl or high-density lipoprotein (HDL) < 40 mg/dl [6] or low-density lipoprotein (LDL) > 130 or triglyceride (Trig) ≥ 150 mg/dl or history of use of statins before the stroke. Cardiac disease was defined as a history or current diagnosis of atrial fibrillation, cardiomyopathy, heart failure, ischemic heart disease, and rheumatic heart disease. Obesity was assessed by defining central adiposity using waist-hip ratio. A waist-to-hip ratio of ≥ 0.90 (men) and ≥ 0.85 (women) was reported as Yes, while values below this were reported as No [6, 20, 24].

Genotyping and imputation

The samples included in this study were genotyped using Illumina’s H3Africa microarray chip. Using Illumina’s GenomeStudio software and its data management plugins, the raw genotypes data was converted into PLINK formatted datasets to interoperate with the downstream quality control (QC) and statistical analysis. Sample QC excluded (a) individuals with sex discordance between reported and observed from genetic data, (b) cases with hemorrhagic stroke, (c) duplicate sample pairs after validating similarity in genetic data based on > 90% concordance in genotype data, (d) mixed-up samples based on genotypic concordance between samples, and (e) outlier samples through estimation of genetic principal components. To address potential population stratification, we performed principal component (PC) analysis using EIGENSTRAT’s Smartpca module [27, 28]. We also excluded participants whose phenotypic and genetic data did not pass quality control and had missing variables in any covariates.

There were 2,221,421 raw variants processed through a series of in-house QC steps, including (a) retention of autosomal SNPs only, (b) removal of ambiguous SNPs (A/T and C/G), (c) removal of non-biallelic variants (e.g., indels, SNPs without a valid alternative allele in the bim file for example “0/T”), and (d) handling strand inconsistencies. Furthermore, SNPs were removed for violation of Hardy-Weinberg equilibrium P < 1.0E−05, minor allele frequency (MAF) < 1%, and/or a missing rate > 10%. After implementing these steps, 1,815,856 genotyped variants were included for imputation. In addition to the above-mentioned QC metrics, McCarthy Group Tools (https://www.well.ox.ac.uk/~wrayner/tools/) was employed to handle strand inconsistencies, ref/alt allele assignment, removal of SNPs not in reference panel, and filtering out SNPs with out-of-bound differences in the minor allele frequency (MAF) when compared with 1000Genomes African-Americans (i.e., SNPs with > 0.2 allele frequency difference between the SIREN cohort and 1000 genomes). Allele frequency and allele assignment fixes in McCarthy tools were performed based on the population-specific reference panels to ensure the African cohort of the SIREN study was compared with its corresponding sub-population cohort of the 1000 genomes.

Having a well-curated quality reference panel is key to discovering true biological signals and minimizing false positives or negatives in our genome-wide association studies. We used the TOPMed release2 reference panel from the BioData Catalyst (https://imputation.biodatacatalyst.nhlbi.nih.gov/#!) for imputing the genotypes. The TOPMed Version release2 panel comprised 97,256 samples and 308,107,085 genetic variants distributed across the 22 autosomes and the X chromosome inferred from jointly called variant set derived from whole-genome sequencing of TOPMed samples. TOPMed Imputation server was configured to (a) use TOPMed as the reference panel, (b) retain variants with an imputation quality filter (R2) > 0.3, (c) employed Eagle v2.4 [29](Ref) for phasing, (d) QC frequency check was conducted before imputation, and (e) Quality Control and Imputation mode was enabled for output QC stats along with imputed dosage and info datasets. Upon completion of imputation to the TOPMed R2 (Freeze8) panel, variants were retained if (a) the imputation quality (R2) > 0.3 and (b) the minor allele count (MAC) > 20. Variants with imputed genotype probabilities < 0.9 were masked as missing to ensure high-quality calls.

Before exploring the association between imputed SNPs and predictors of interest, imputed variants were further quality controlled for genotypic characteristics. SNPs were retained for association analysis only when they met the criteria of (a) attaining a Minor-Allele Frequency > 1%; (b) being SNPs only, not indels (which were removed); and (c) having an Imputation quality, R2 > 0.3. Although imputation quality is a composite score that would aggregate individual genotype quality across all samples and issue a variant level metric, to foster high-quality genotype calls, we examined the genotype probabilities (GP) associated with each genotype call and masked the genotype calls to missing if the probability of the inferred call was < 90%. All post-imputation quality control steps were conducted using PLINK 1.9 [30] and VCFTOOLS 0.19 (https://vcftools.sourceforge.net/man_latest.html). After imputation, a total of 50,877,079 variants were processed through a quality-control pipeline to yield a final count of 44,159,966 variants (R2≥0.3) for statistical association tests in PLINK1.9. Of the 44,159,966 SNPs used for downstream association analysis, 77% of imputed variants had R2≥0.8, and 91% had R2≥0.5.

Association methods and analyzed models

Statistical association analysis was conducted using PLINK 1.9. To test for associations between ischemic stroke status and variant SNPs, we fitted a logistic regression model where SNP was modeled as a predictor variable whose values were equal to the number of copies of the minor allele (0, 1, 2) (i.e., additive mode of inheritance). In all association analyses, we used the first 10 principal components (PCs) as covariates to control for ancestry. Our primary model (model 0) for association included sex, age, 10 PCs, and SNP as a covariate in logistic regression. For sensitivity analyses, the stroke risk factors were added to the base model in nested regression models hierarchically to ensure the significant SNPs found in the base model are associated with stroke and are not mediated by risk factors. The sensitivity analysis models are given below:

  • Model 1: stroke status ~ sex + age + PCs1 … 10 + SNP + hypertension

  • Model 2: stroke status ~ sex + age + PCs1 … 10 + SNP + hypertension + diabetes

  • Model 3: stroke status ~ sex + age + PCs1 … 10 + SNP + hypertension + diabetes + dyslipidemia

  • Model 4: stroke status ~ sex + age + PCs1 … 10 + SNP + hypertension + diabetes + dyslipidemia + cardiac disease status

  • Model 5: stroke status ~ sex + age + PCs1 … 10 + SNP + hypertension + diabetes + dyslipidemia + cardiac disease status + waist-hip ratio

Other cohorts

The COMPASS and MEGASTROKE were also involved in the analysis. The constituent studies of both COMPASS and MEGASTROKE are described in Additional file 2: Other Study Cohorts.

Meta-analysis

We meta-analyzed association test results using the random-effects model of Han and Eskin implemented in METASOFT [31] with SIREN and COMPASS data sets. Lastly, we used Meta-Analysis of TRansethnic Association studies (MANTRA) [32] software to perform meta-analysis using SIREN (a West-African study), COMPASS (an African-American study), and MEGASTROKE (a European study). There are several advantages of using METASOFT, namely, (1) it provides fixed effects model (FE) based on inverse-variance-weighted effect size similar to METAL [33], (2) conventional random effects model (RE) based on inverse-variance-weighted effect size, (3) Han and Eskin’s random effects model (RE2) optimized to detect associations under heterogeneity, and (4) binary effects model (BE) optimized to detect associations when some studies have an effect and some do not have any effect.

Fine-mapping

In our fine-mapping analysis, we used the PAINTOR [34] software package to discover potential causal variants. Although fine-mapping regions are defined as regions identified using a window (~50 kb) around the most significant variant; given the distribution of intergenic variants with genome-wide association significance of P-value < 1.0E−4, we expanded to a wider window where variants’ linkage disequilibrium with the lead variant extended outside the window. This was achieved by manual inspection of regional association plots to ensure the most relevant region was adequately captured.

To determine top tissue-based annotation sets for each region, we used the approach showcased in the PAINTORv3 fine-mapping software distributed through the GitHub repository. To determine the annotation relevant to stroke, we ran PAINTOR on each annotation independently. The sum of the log-Bayes factors (BFs) and effect size estimates for each annotation is further converted to relative probability for an SNP to be causal in a certain annotation track. To test the significance of annotation, the sum of the log-Bayes factors with only baseline annotation was compared with both baseline and the annotation of interest. The significance of the enrichment was further calculated from a standard ratio test comparing null (baseline annotation) and alternate (both baseline and annotation of interest) modes. By the likelihood ratio test (LRT) approach of testing each annotation, we selected the top 10 annotations to calculate the posterior probability of each SNP within our sliding window containing top GWAS SNPs.

Functional stratum of significant hits

The working set of top SNPs from our association analysis was further annotated using ANNOVAR to determine both gene and SNP level function. dbSNP151 data release from UCSC was employed to assign rs# naming conventions to our variants reported in the additional file results dataset. To address discrepancies in the genome geography between human genome builds hg19 and hg38, functional annotations for both hg19 and hg38 are catalogued in all additional file tables. Since the traditional annotation assignment is based on just the genomic transcription coordinates of a gene, an additional 50 kb flanking distance was allowed for top SNPs to finalize the gene assignment to association analysis top SNPs. An arbitrary flanking distance of 50 kb around the transcription start and end positions allows reporting SNPs with significant association with ischemic stroke that could circumscribe broader biochemical signatures typically associated with non-coding functional elements like gene promoters, upstream enhancers, regulators, insulators, and TFBS (transcription factor binding sites).

Functional mapping and annotation (FUMA)

FUMA [22] is an online platform for the functional mapping of genetic variants. FUMA performs functional annotation of GWAS results, prioritization of potential causal genetic variants and genes, and interactive visualization by biological data repositories and tools. FUMA contains two core functions to annotate input summary statistics (both SNPs and genes) to prioritize potential causal genetic variants and genes: SNP2GENE and GENE2FUNC modules. In the SNP2GENE module, SNPs are annotated with their biological function and mapped to genes based on positional and functional information of SNPs. Functionally annotated SNPs are mapped to genes based on functional consequences on genes (positional mapping), expression quantitative trait loci (eQTLs), and chromatin interactions of phenotype relevant tissue types. FUMA utilizes three strategies. First is positional mapping based on the physical distances (within a 10-kb window) from known protein coding genes in the human reference assembly (GRCh37 or hg19). Second is eQTL mapping with capturing information from three data repositories (GTEx, Blood eQTL browser, and BIOS QTL browser) and mapping SNPs to genes based on a significant eQTL association. It should be noted that eQTL mapping is based on cis-eQTLs (local regulatory effect within 1 Mb). A false discovery rate (FDR) of 0.05 is used to define significant eQTL association. Third is chromatin interaction mapping, involving mapping of SNPs to the promoter regions of genes based on significant chromatin interactions. FUMA selects chromatin interactions for which one region involved in the interaction overlapped with predicted enhancers and the other overlapped with predicted promoters 250 bp upstream and 500 bp downstream of the transcription start site (TSS) of a gene. By combining these three mapping strategies, FUMA prioritizes genes that are most likely to be involved in the trait of interest such as ischemic stroke. To obtain insight into putative causal mechanisms, the GENE2FUNC process annotates the prioritized genes in biological context, such as tissue specific gene expression pattern, and enrichment of gene sets.

Gene set analysis

Genes implicated by mapping of GWAS SNPs were further investigated using the GENE2FUNC procedure in FUMA, which provides hypergeometric tests of enrichment of the list of mapped genes in MSigDB gene sets, including BioCarta, KEGG, Reactome, and Gene Oncology (GO). The adjusted P-value (FDR) for gene set enrichment analysis is performed by the Benjamini-Hochberg procedure. We used the threshold of adjusted P-value 0.05 and the two minimum number of input genes overlapping with a tested gene. UCSC Genome Browser on Human Feb. 2009 (GRCh37/Hg19) Assembly was used to render the omics landscaping around the significant SNP regions.

Results

Characteristics of the study sample

To ensure retention of high-quality samples relevant to our research study, we followed strict protocol to retain only samples that met our quality thresholds (detailed descriptions of quality control procedures are provided in the “Methods” section). We retained 1683 ischemic stroke cases and 1738 stroke-free controls with a sex-stratified distribution of 1830 males and 1591 females after the application of stringent QC criteria. The demographic and risk factor characteristics by case-control status are described in Table 1. The mean age of the subjects with ischemic stroke was 61.2 (± 13.7) years, while the mean age of stroke-free control subjects was 59.5 (± 13.5) years (P-value = 0.0005). Consistent with previous observations, we demonstrated an abnormal waist-hip ratio as a strong risk factor for stroke (P-value < 0.0001). Cases were significantly more likely than controls to have a history of hypertension (95% vs. 63%) (P-value < 0.0001), diabetes (36% vs 14%) (P-value < 0.0001), dyslipidemia (73% vs. 61%) (P-value < 0.0001), and cardiac disease (13% vs. 6%) (P-value < 0.0001); we did not observe significant differences between cases and controls with respect to sex (P-value = 0.7062). We investigated clustering of potential ethnic differences in comparison with other 1000G populations using principal component analysis (PCA). The SIREN samples clustered together with 1000G African samples (Additional File 4: Fig. S1).

Table 1 Characteristics of the SIREN case-control samples after QC

Discovery genetic association analysis

Manhattan plots for all six models are depicted in Fig. 1 starting with the primary/base model adjusted for sex, age, 10 PCs, and SNP. The base model was adjusted by adding one risk factor at a time hierarchically such as hypertension, diabetes, dyslipidemia, cardiac status, and waist-hip ratio. The quantile-quantile (QQ) plots are shown in Additional File 4: Fig. S2. We used the method proposed by Li and Ji based on spectral decomposition to estimate the effective number of SNPs (i.e., the number of independent SNPs) using 1,575,904 SNPs (MAF ≥ 0.01) (14, 15). We found that the number of independent SNPs are ~987,177 SNPs, which is close to 1M. We used a significance level of 5.06E−08 (= 0.05/987177) to correct for multiple testing. In Additional File 1: Table S1, we provide the ischemic stroke association with all SNPs in six models with P-value < 1.0E−6. Thirty-two [32] loci in chromosomes 2, 3, 5, 6, 7, 12, and 13 attained significance (P-value < 1.0E−6) in at least one of the six models. Note that there were only 7 independent SNPs. The goal was to show that these 7 SNPs had good linkage disequilibrium support, given in Additional File 1: Table S1. We observed genome-wide significant SNP associations near the AADACL2 gene (distance ~50 kb) in chromosome 3 with the inclusion of hypertension to the base model [rs6440776, odds ratio (OR) of 0.73 with 95% CI: 0.66-0.82, P-value = 3.71E−08] (Table 2, Additional File 1: Table S1). Adding diabetes to the model in addition to hypertension, rs6440776 remained genome-wide significant. Furthermore, adding dyslipidemia to the model with hypertension and diabetes, the significance level was slightly below the genome-wide significance level for rs6440776 (rs6440776, OR 0.73 with 95% CI 0.66-0.82, P-value = 5.59E−08). Note that adding cardiac status and waist-hip ratio to the model, both SNPs remained significant with a significance level (P-value < 1.0E−06) (Table 2). Furthermore, a similar association pattern was observed in SNPs near the MIR4458HG gene (distance ~33 kb) in chromosome 5 with marginal significance (P-value < 1.0E−05) in all models (Table 2, Additional File 1: Table S1). Additional File 1: Table S2 contains the association results for any SNPs with a P-value < 1.0E−04. The Locus Zoom plots for SNPs in chromosomes 3 and 5 are shown in Fig. 2, and locus zoom plots for SNPs in chromosomes 2, 6, 7, 12, and 13 are shown in Additional File 1: Fig. S3. Note that the SNPs with suggestive significance in chromosome 2 were more than 85 kb from the closest gene LINC01854, and SNPs in chromosome 7 were more than 116kb to the closest gene LINC01446. In addition, we observed suggestive significance with SNPs in genes CLIC5 (chromosome 6), GALTN9 (chromosome 12), and closest gene FAM155A (chromosome 13) (P-value < 1.0E−5) in all five models ( Additional File 1: Table S1).

Fig. 1
figure 1

Manhattan plots. a The base model adjusted for sex, age, 10 PCs, and SNP as in model 0. b Hypertension is added to the base model 0. c Diabetes is added to the model 1. d Dyslipidemia is added to model 2. e Cardiac status is added to model 3. f Waist-to-hip ratio is added to model 4

Table 2 Novel SNPs association with ischemic stroke*
Table 3 Meta-analysis of SIREN and COMPASS studies for fixed effects (FE), conventional random effects (RE), alternate random effects (RE2), and binary effects (BE) models from METASOFT
Fig. 2
figure 2

Locus zoom plots for SNPs rs6440776 (hg19: chr3:151396081 and hg38:151678293) and rs77326269 (hg19: chr5:8499398 and hg38:chr5:8499286) based on P-values using the base model

Transferability analysis

Due to lack of a replication sample of indigenous Africans, we investigated the transferability of our findings in COMPASS (African-American meta-analysis) and MEGASTROKE (European Ancestry Meta-Analysis). Additional File 1: Table S3a shows the statistical significance in COMPASS (column BB provides the P-values in COMPASS) and MEGASTROKE (column BI for P-values in MEGASTROKE) for top SIREN hits corresponding to Additional File 1: Table S1 with P-value < 1.0E−06. Additional File 1: Table S3b presents significance levels in COMPASS and MEGASTROKE based on SIREN P-value < 1.0E−04 corresponding to Additional File 1: Table S2. Note that there were only two SNPs rs116683655 and rs76250200 within gene ISPD with P-value < 1.0E−04 in SIREN were marginally significant with P-values 3.98E−03 and 5.57E−03 in COMPASS, respectively. The lowest P-value in MEGASTROKE was 8.38E−03 corresponding to SIREN with P-values < 1.0E−04 for the SNP rs7239115 in chromosome 18 within gene region LINC01898-LOC339298. Conversely, we also investigated the transferability status of variants previously associated with stroke in COMPASS and MEGASTROKE in SIREN (Additional File 1: Table S4a and S4b). In Additional File 1: Table S4a, S4b, and S13a-d, we have identified and listed out specific SNPs associated with stroke risk among African-Americans and Europeans as identified in the COMPASS and MEGASTROKE studies respectively. SNPs labeled as multi-ancestry were also identified. We observed a nominal association with multiple SNPs in COMPASS including rs116262092 (P-value = 0.02) and rs147867382 (P-value = 0.02) in the RUNX1 gene in chromosome 21, rs184221467 (P-value = 0.02) near the AK092619 gene in chromosome 3, and rs115670077 (P-value = 0.01) between the RFTN2-MARS2 gene in the SIREN cohort with a similar direction of effect as in the COMPASS. Additional analysis comparing the effect sizes of the variants across the COMPASS and SIREN cohorts demonstrated similar effect sizes and direction of effect in most of the loci.

We further investigated the transferability status of variants previously associated with stroke subtypes in COMPASS and MEGASTROKE in SIREN. We replicated the top significant SNPs associations in COMPASS and MEGASTROKE in the SIREN for large artery disease (cases = 509 vs. controls = 1738), small vessel occlusion (cases = 590 vs. controls = 1738), and undetermined etiology (cases = 451 vs. controls = 1738). None of the top loci in COMPASS or MEGASTROKE were significant with Bonferroni correction for any of the subtypes. The results for subtypes corresponding to COMPASS and MEGASTROKE are provided in Additional File 1: Table S5a and S5b, respectively, and showed marginally significant results in subtypes with P-value < 0.05 in SIREN. The effect sizes are in the same directions as in COMPASS and MEGASTROKE except for SNP rs113025543 (FAR) which is a protective factor in COMPASS but a risk factor in SIREN for small-vessel disease and SNP rs11867415 (PRPF8) which is a risk factor in MEGASTROKE but protective in SIREN for small-vessel disease (Additional File 1: Table S5c contains the summary of the marginally significant results of the subtypes in SIREN).

African ancestry meta-analysis

Additional File 1: Table S6 contains the results from METASOFT for P-values < 1.0E−04 corresponding to the RE2 model. There were 14,053,108 SNPs common to both SIREN and COMPASS. Table 3 provides a summary of the METASOFT results with P-values less than 1.0E−06 for Han and Eskin’s random effects model (RE2) and the binary effects model (BE) for meta-analysis models, heterogeneity value I2, and corresponding SIREN and COMPASS P-values and their effect size directions. There were 15 SNPs in Han and Eskin’s random effects model (RE2) and 13 SNPs in the binary effects model (BE) with P-value < 1.0E−06. COMPASS SNPs drove most of the SNP significance in the RE2 model. However, SIREN SNPs were significant for BE model with I2 greater than or equal to 0.90 with P-value < 1.0E−06. Note that rs6440776 in the intergenic region of MIR5186-AADACL2 in chromosome 3 and rs2194650 in POM121L12-LINC01446 were also significant with a P-value less than 1.0E−06 in the BE model corresponding to SIREN P-value < 1.0E−06. Moreover, the direction of effect between associations of the loci with ischemic stroke in both SIREN and COMPASS studies were similar for 2504 SNPs out of 3111 in Additional File 1: Table S6 and SIREN vs. COMPASS effect size plot in Additional File 4: Fig. S4.

Transethnic meta-analysis

Transethnic meta-analysis was performed in MANTRA using SIREN, COMPASS and, MEGASTROKE studies. There were 6,092,926 SNPs common to all three studies. The MANTRA results with log10 (Bayes factor) > 4 are included in Additional File 1: Table S7. A summary of the MANTRA results is given in Table 4 containing log10 (Bayes factor) ≥10.0. The significance of the all SNPs in Table 4 was mainly driven by MEGASTROKE SNP’s P-values and their effect sizes. Note that MEGASTROKE was the largest study among all three studies, with a sample size of 446,696, while COMPASS had 22,051 individuals compared with SIREN with 3434 individuals. It is not uncommon for a meta-analysis to be heavily dominated by a single largest study [35, 36]. We observed that allele frequency distributions in MEGASTROKE were different compared to COMPASS and SIREN (see Additional File 4: Fig. S5). COMPASS and SIREN allele frequency distributions were similar (see Additional File 4: Fig. S5). There were 231 SNPs with log10 (Bayes factor) ≥ 6.0, and most of the SNPs were significant in MEGASTROKE. SIREN study-driven MANTRA results are given in Table 5 with Bayes factor of at least 4.0 with posterior probability of 1 and SIREN P-value < 1.0E−04. Both SNPs rs6440776 and rs2410883 in MIR5186-AADACL2 in chromosome 3 had Bayes factor greater than 5 with effects in the same direction in all three studies. The SNPs in chromosomes 7, 18, and 20 had Bayes factor greater than 4.0 with a posterior probability of 1 corresponding to SIREN P-values < 1.0E−04.

Table 4 Results from MANTRA using SIREN, COMPASS, and MEGASTROKE studies with Log10 (Bayes factor)≥10
Table 5 SIREN study driven MANTRA results with Log10 (Bayes factor)≥4.0*

Fine-mapping

Before performing fine-mapping, localized zoom plots in Fig. 2 were consulted for both regional association landscape and linkage disequilibrium with the lead variant in the region of interest. Fine-mapping regions were initially identified using a genomic base-pair window size of 500 kb on both 5′ and 3′ ends of the significant hits near AADACL2 and MIR4458HG genes based on the hg19 coordinate system. Fine-mapping in chromosome 3 indicated 2 variants out of the 627 variants considered were potentially causal (rs7611359, position: 151266619, posterior probability = 1.0 with 99% credible interval; and rs9815407, position: 151269245, posterior probability = 1.0 with 99% credible interval) (Fig. 3a). Similarly, fine mapping in chromosome 5 indicated 4 out of the 568 variants considered were potentially causal (rs341875, position: 8512751, posterior probability = 0.17 with 99% credible interval; rs77326269, position: 8499398, posterior probability = 0.14 with 99% credible interval; rs73740017, position: 8499591, posterior probability = 0.14 with 99% credible interval; and rs57085808, position: 8496279, posterior probability = 0.13 with 99% credible interval) (Fig. 3b). To select the top five 10 annotation sets for each region, we employed the suggested pipeline outlined in the PAINTOR software GitHub repository. Additional File 1: Table S8a and S9a capture the marginal significance estimates for each annotation and the overall likelihood ratio test (LRT) estimates, which were used to select the top 10 annotations of interest.

Fig. 3
figure 3

a Fine-mapping of AADACL2 gene region. b MIR4458HG gene region. Panel 1 depicts a scatterplot of location versus posterior probabilities with a 99% credible interval; panel 2 provides functional annotation tracks

Gene sets enrichment analysis

To determine gene expression profile tissue/cell type specificity for our genes of interest, we used a gene lookup mechanism in GTExPortal V8 (https://www.gtexportal.org). The gene expression analysis in GTExPortal V8 for MIR4458HG and AADACL2 genes is shown in Additional File 4: Fig. S6a and S6b. The highest expression was observed in brain-cerebellar hemisphere and brain-cerebellum for the MIR4458HG gene. To further understand any functional implications of significant single variant association analysis, we performed functional annotation mapping (FUMA) GWAS module SNP2GENE. We used any SNPs in any model with a P-value < 1.0E−5 for SNP2GENE analysis. MAGMA tissue-specific expression analysis results of SNP2GENE module are given in Additional File 1: Table S10a. Tissue-specific expression analysis with P-value < 0.05 was observed in thyroid, brain cerebellar hemisphere, and brain cerebellum tissues. In addition, we performed GENE2FUNC using a compilation of 191 genes that were aggregated from ANNOVAR gene assignment report for SNPs in Additional File 1: Table S2 and genes that showcased chromatin and eQTL interactions based on SNP2GENE results. The 143 genes with recognized unique Ensembl ID were used in annotation and mapping. In specific tissue analysis, FUMA GENE2FUNC differentially expressed genes were either upregulated or downregulated. Enrichment for upregulated gene differential expression in the brain was observed in brain spinal cord cervical C-1 (padj = 0.043) and downregulated in brain frontal cortex BA9 (padj = 0.032) along with brain cortex (padj = 0.090). The details regarding upregulated and downregulated are provided in Additional File 4: Fig. S6a and S6b and Additional File 1: Table S10b. We also observed two-sided significant regulation of genes in specific tissues, namely brain frontal cortex BA9 (padj = 0.023).

Genomic landscaping for genes AADACL2 and MIR4458HG

Given the dense distribution of variants in and around the lead significant SNP, localized genomic visualization models, Figs. 4 and 5, were rendered to investigate the (1) presence of methylation hotspots in the form of CpG islands/shores, (2) observance of enhancer and promoter activity reported by GeneHancer, and (3) interaction between GeneHancer regulatory elements and neighboring genes. Furthermore, brain DNA methylation profile was also investigated in and around the region of significant SNPs, and the same is showcased as independent tracks in the rendered regions (a) genome-wide methylation (MeDIP-seq and MRE-seq) landscape, (b) histone H3 lysine 4 trimethylation (H3K4me3), and (c) gene expression (RNA-seq and RNA-seq (SMART)) profiles. Figure 4a illustrates the chromatin interaction link between significant regions proximate to AADACL2 gene and nearby IGSF10 gene using SNP2GENE function in FUMA. Additional File 1: Table S11 articulates the significant intra-chromosomal chromatin interaction and strength of SNP-gene-tissue eQTL mapping for genome-wide significant SNP regions along with novel SNPs near AADACL2. Based on the GWAS significance statistics for SNPs in that region, P2RY13 and P2RY14 are potential eQTLs with significant mapping interaction with the novel SNPs near AADACL2. UCSC Genome Browser on Human Hg38 build was used to render the omics landscaping around the significant SNP regions. AADCL2 omics landscape in Fig. 4b reports minimal promoter and enhancer presence. Interestingly, 5 clustered interactions of gene enhancer regulatory elements and the AADAC gene, which is located only 56 kb downstream of AADACL2, were observed around the region of the AADACL2 gene. Figure 5a depicts the chromatin interaction of rs57085808 with nearby genes. As depicted in Fig. 5b, the presence of methylation hotspot, CpG Island, at the 5′ end of MIR4458HG demonstrated a high level of H3K27Ac epigenetic modification signal. H3K27Ac histone mark is known to be a strong marker of active promoter and enhancer activity that is strongly associated with the transcription factor binding mechanism and gene expression profile. Histone mark’s activity is further validated by the presence of a cluster of strong active promoter regions (red bands) along with transcriptional transition and elongation (green bands) hotspots, thereby offering some potential interaction between DNA methylation and histone modifications around the region of MIR4458HG gene. Additional File 1: Table S12 contains the significant intra-SNP-gene-tissue eQTL mapping for gene MIR4458HG.

Fig. 4
figure 4

a Circos plot showcasing chromatin interaction (orange arcs) and eQTL interactions (green arcs) originating from SNP rs6440776 (AADAC gene region). b Genomic landscape for AADACL2 illustrating CpG islands, enhancer/promoter presence, histone modification sites, and regulatory interaction activity from UCSC browser

Fig. 5
figure 5

a Circos plot showcasing chromatin interaction (orange arcs) and eQTL interactions (green arcs) originating from SNP rs57085803 (MIR4458HG gene region). b Genomic landscape for MIR4458HG illustrating CpG islands, enhancer/promoter presence, histone modification sites, and regulatory interaction activity from UCSC browser

Genome geography discrepancies

Although there is healthy validation and verification of sequence similarity between multiple gene transcripts for a certain genomic region, genomic annotations are yet to reach robust levels of certainty and stability across evolving versions of human reference genomes. Although we employed TOPMed imputation reference panel with human genome Hg38, much of our replication cohorts like COMPASS and MEGASTROKE reported their variants based on Hg37. To accommodate potential inconsistences between these two different versions of the human genome, we presented variant annotations in our additional file datasets for both Hg19- and Hg38-based coordinate systems. At a glance, the genome versioning challenge also helped us unravel few issues with annotating SNPs for assigning HUGO approved gene names, genomic functions, and SNP annotations. One of our top-hit variant rs6440776 was reported on chr3:151678293 based on Hg38 genome assembly and on chr3:151396081 based on Hg37 assembly. Based on the version of the assembly used, SNP rs6440776 was mapped to intergenic regions between genes MIR5186-AADACL2 based on Hg38 and mapped to ncRNA intronic region of gene MIR548H2 based on hg19. Also, based on the version of the dbSNP data repository used to drive the SNP annotations, the same variant on Chr2 at position 129359443 (Hg38) with mapped position 130117016 (Hg19) was assigned registered dbSNP name rs111452560 and rs116332314 between different data releases of dbSNP database.

Discussion

In this first genome-wide association study of ischemic stroke among indigenous Africans, we observed genome-wide significant SNPs associations (rs6440776 and rs2410883) near the AADACL2 gene in chromosome 3, after adjusting for hypertension, diabetes, and dyslipidemia in the base model as covariates. Five SNPs (rs57085808, rs57033994, rs143745837, rs77326269 and rs73740017) near the miRNA (MIR4458HG) gene in chromosome 5 were also associated with ischemic stroke with suggestive significance (P-value < 1.0E−6)). The loci near AADACL2 and MIR4458HG genes are novel and protective. The region near gene AADACL2 remained marginally significant following African ancestry meta-analysis and fine mapping. The functional and clinical relevance of the identified risk loci is further supported by eQTL and chromatin interaction data. The observed protectiveness of these loci against stroke has promising implications for ancestry-specific risk stratification and the search for drug targets that can enhance the primary or secondary prevention of stroke (please see Additional File 3: Additional Discussion (additional discussion point a and additional discussion point b) on other marginally significant genetic variants).

The arylacetamide deacetylase like 2 (AADACL2) gene is a protein coding gene that is strongly expressed in the skin, an organ that shares embryological origins with the nervous system. The gene is implicated in epidermal barrier function [37] and has demonstrated previous associations with multiple phenotypes including idiopathic dilated cardiomyopathy [38]. Loci near AADACL2 in the present study demonstrate protection against ischemic stroke with top SNPs: rs6440776 with OR 0.74 (0.66–0.82) and P-value = 3.71E−08 and rs2410883 with OR 0.74 (0.66–0.82) and P-value = 4.38E−08 when hypertension was included in the model.

Fine-mapping of the significant genomic regions near the AADACL2 gene in chromosome 3 yielded two potentially causal variants rs7611359 and rs9815407 with a posterior probability of 1.0. Gene expression profiling results for the AADACL2 gene using GTEx v8 yielded maximum expression in the skin while genomic landscaping yielded minimal enhancer, histone modification, and regulatory interaction activity. In addition, 5 clustered interactions of gene enhancer regulatory elements and the AADAC gene located 56 kb downstream of AADACL2 were observed around the region of the AADACL2 gene. The significant histone modification and regulatory activity of the novel loci near the AADACL2 gene plausibly explain the protection against ischemic stroke demonstrated in this study. Potential interactions involving the discovery novel loci near AADACL2 in this study and other genes, particularly in proximity within the chromosome 3, may also explain the protective function of the novel loci in relation to ischemic stroke. Chromatin interaction mapping of regions proximate to the AADACL2 gene demonstrated significant intra-chromosomal chromatin interaction with the IGSF10 (immunoglobulin superfamily, member 10) gene with relevant immune regulatory functions [39].

The MIR4458HG gene is an intergenic non-coding miRNA gene with multiple tissue expression in the brain, arteries, and other tissues [40, 41] as well as metabolite level and heart rate in heart failure with reduced ejection fraction [42]. The MIR4458HG gene was previously associated with coronary artery calcification in a GWAS study among type 2 diabetes in African-American/Afro-Caribbean subjects [43]. In this study, SNPs near the MIR4458HG gene locus demonstrated protection against ischemic stroke with ORs < 1 at suggestive significance levels.

Fine-mapping of the significant genomic regions near the MIR4458HG gene in chromosome 5 yielded 4 variants considered potentially causal, top of which was rs341875 with a posterior probability of 0.17. Gene expression analysis was undertaken for the MIR4458HG gene in GTExPortal V8 in both general and specific tissues. This demonstrated the highest expression in the brain cerebellar hemisphere, cerebellum, and thyroid as well as artery tibial and coronary arteries. Functional annotation mapping (FUMA) expression analysis in MAGMA demonstrated differential gene expression in the brain spinal cord cervical C1 and brain frontal cortex BA9. Genomic landscaping for MIR4458HG yielded methylation signals, strong enhancer/promoter activity, histone modification sites, and regulatory interaction activity with the high level of H3K27Ac epigenetic modification signaling. These findings demonstrate epigenetic interactions including DNA methylation and histone modifications around the MIR4458HG gene and thus suggest regulatory activity in the variants near the MIR4458HG gene as a plausible mechanism for the protective effect on ischemic stroke and the consequent potential of the region containing targets for drug development for primary or secondary prevention of stroke [11].

A recent cell culture study demonstrated that miR-4458 negatively modulated cardiac hypertrophy, a known intermediate phenotype, and an independent risk factor for ischemic stroke, by activating mitochondrial transcription factor A (TFAM), a well-recognized myocardial protective protein. Indeed, miR-4458 facilitated TFAM expression in cardiomyocytes to inhibit cardiac hypertrophy [44]. Several other micro-RNA genes have also demonstrated protection against ischemic stroke such as miR-375 [45], miR-195 [46], miR-221 [47], miR-338 [48], and exhibiting protection against ischemic stroke via multiple mechanisms. Moreover, microRNAs constitute an emerging and promising category of biomolecules with the promise of enhancing risk prediction, diagnosis, prognosis, and treatment of ischemic stroke and the subtypes [49,50,51].

Clinical implications of functional expressions and interaction analysis

Expression quantitative trait loci (eQTL) mapping and chromatin interaction analysis in FUMA demonstrate interaction of variants with either genomic or suggestive significance with other multiple variants with significant expression in vascular or brain tissue and association with cerebrovascular disease phenotypes, other brain disorders, or vascular diseases (Additional File 1: Table S10 and S11). For instance, novel loci near the AADACL2 gene yielded potential eQTLs including AADAC, MBNL1, TMEM14E, P2RY13, and P2RY14 genes with P2RY13 and P2RY14 demonstrating significant mapping interaction. The purinergic receptor (P2Y13) plays a major role in HDL metabolism by facilitating reverse cholesterol transport and promoting the inhibition of atherosclerosis progression) [52,53,54]. Thus, it appears that the protectiveness of the novel locus near AADACL2 against stroke may be associated with its epistatic interaction with the P2RY13 gene. Systems genetics analysis has also defined the importance of transmembrane protein 43 (TMEM43) in cardiac- and metabolic-related pathways, suggesting that cardiovascular disease-relevant risk factors may also increase risk of metabolic and neurodegenerative diseases via TMEM43-mediated pathways [55]. Broad cellular functions and diseases including arrhythmogenic right ventricular cardiomyopathy (ARVC5) have been associated with transmembrane protein43 (TMEM43).

Taken together, the findings in this study demonstrate emerging differential roles for regulatory miRNA, intergenic non-coding DNA, and intronic non-coding RNA in the pathobiology of ischemic stroke. The protectiveness of some genetic loci related to miRNAs, which are largely regulatory, suggests the possible occurrence of downstream biomolecules and processes in dysregulated pathways and networks, which require further exploration and characterization. Indeed, multiple loci which demonstrate significant interaction with our key discovery variants (with regulatory function) through FUMA have shown expression in brain, vascular, cardiac, and neuronal tissue apart from direct association with different subtypes of cerebrovascular disorders. These have implications for novel fluid biomarkers for stroke, drug development, and repurposing, multi-omics analysis including genome-wide miRNA analyses, and generation of polygenic risk score (PRS) that will likely be more accurate for African populations [56,57,58].

Comparison with existing stroke GWAS

Replication is a critical part of the process of studying genome-wide association studies, while the concept of transferability is used when the replication cohort is drawn from a different population other than the discovery sample [59, 60] (please see additional discussion point c in Additional File 3: Additional Discussion). Findings from the SIREN discovery analysis demonstrated poor transferability in the COMPASS meta-analyses among African-Americans [8, 9] and vice versa possibly because of genetic admixture in the African-Americans. However, the similarity of direction of effect between the associations of the loci with ischemic stroke in both SIREN and COMPASS studies strengthens the biological validity of the association of these loci with ischemic stroke (Additional File 4: Fig. S4) [13]. Similarly, the findings from the MEGASTROKE meta-analysis [10] showed non-transferability in both SIREN and COMPASS GWAS analyses. The MEGASTROKE GWAS was in a predominantly European ancestry population with only 4.0% African ancestry (African-Americans) which is slightly more than the 3.7% African ancestry in GIGASTROKE [19]. Differences in the ancestral backgrounds of the SIREN and MEGASTROKE cohorts and the dominance of small vessel disease stroke subtype among blacks compared to Caucasians are plausible reasons for this non-transferability. A recent high-depth study of African genomes identified more than 3 million previously undescribed genetic variants [18]. This observation underscores the uniqueness of the genetic architecture of indigenous African populations with variants which may not be present in other populations. This has implication for the non-transferability in this study and other African studies (DM, glaucoma and lipid traits) [13, 61, 62] (please see additional discussion points d and e in Additional file 3: Additional Discussion). The existence of such ancestry-specific variants has implications for the development of polygenic risk scores (PRS) of higher accuracy in the stratification of individuals based on disease risks. This therefore strengthens the argument for ancestry or region-specific PRS.

Strengths, limitations, and future direction

Our study has a major strength in being the first stroke GWAS in an indigenous African population with novel functional and clinical implications. The key limitations are the absence of a suitable independent replication cohort of indigenous African ancestry and the non-availability of databases enriched with African ancestry information for in silico functional analysis. These could have limited the full understanding of the functional implications of our discoveries. This limitation is particularly common to pioneering GWAS studies of African ancestry individuals such as the recent GWAS of rheumatic heart disease [63]. The current study was also not sufficiently powered for stroke sub type-specific analysis to identify ischemic stroke sub type-specific risk loci. We found marginally significant transferability upon investigation of variants associated with ischemic stroke subtypes due to small vessel disease and large artery atherosclerosis. Future larger stroke GWAS studies are required to accurately dissect the genetic and pathological heterogeneity between ischemic stroke subtypes among indigenous Africans. We investigated the functional relevance of the identified risk loci using bioinformatic analyses that we plan to confirm via in vitro and in vivo studies in the near future.

Conclusions

In this first-ever GWAS of stroke in indigenous Africans, novel genomic regions near genes AADACL2 and MIR4458HG exhibited significant protective associations with ischemic stroke with significant eQTL mapping and chromatin interactions with multiple loci associated with vascular disorders. Our findings identify potential roles of regulatory miRNA, intergenic non-coding DNA, and intronic non-coding RNA in the pathobiology of ischemic stroke among indigenous Africans.