Background

A central goal of genetics is to identify the genetic underpinnings of human diseases. Advancements in human genetics and its related fields and technologies over the past decades have had a remarkable impact on our understanding of human disease pathophysiology, diagnosis and management [1]. In Mendelian disorders and rare genetic diseases this often takes the form of a loss-of-function mutation or genomic abnormality driving the disease phenotype. There are more than 5,000 diseases that belong to this category accounted for in the Online Mendelian Inheritance in Man (OMIM) database [2]. For complex diseases, there are multiple genetic and environmental factors contributing to disease risk and the identification of genetic risk factors associated with complex diseases has been rapidly accelerating with the utilization of next generation sequencing and dense array genotyping technologies in genome-wide association studies (GWAS). In a GWAS, thousands of genetic variants are genotyped in individuals which are then used to identify statistical associations between variants at certain genomic loci and a particular phenotype [3]. Since the first reported GWAS association for age-related macular degeneration [4] the use of these studies have grown exponentially, with over 200,000 genetic variants associated with more than 3000 human traits reported [5]. The remarkable growth of GWAS has created a critical need to experimentally identify and validate the disease-associated variants [6, 7]. This barrier has hindered the translation of GWAS findings to disease biology mechanisms and hence therapies. There are seemingly very few examples of GWAS-identified genetic loci at which the causal variant and molecular mechanisms driving the association have been experimentally determined, especially considering the sheer number of genotype–phenotype associations that have been reported passing the genome-wide significance threshold.

Dissecting GWAS loci to uncover the underlying biology is a complicated multi-step process. High linkage disequilibrium (LD) between many variants often necessitates utilizing statistical fine-mapping approaches and overlapping with functional genomic annotations for prioritization of variants before experimental validation [3, 8]. For coding variants, the target gene is identified directly from the genomic location of the variant [9]. As protein-coding regions represent only a small percentage of the human genome, more than 90% of GWAS associated variants are annotated to be within non-coding parts of the genome [5]. Experimental identification and validation of non-coding variants involves additional level of complexity as compared to coding variants requiring the application of additional approaches [10, 11]. Moreover, the functionality of regulatory elements is often cell-type specific, which necessitates studying the mechanism in disease-relevant cell types [12].

Experimental identification and validation are critical elements in translating GWAS findings. To date there has been limited study of the number of GWAS-identified loci that have been experimentally validated. A systematic literature review of 36,676 published articles identified 309 experimentally validated non-coding GWAS variants, regulating 252 genes across 130 human disease traits. This review of the literature is the first to systematically evaluate the status and the landscape of experimentation being used to validate non-coding GWAS-identified variants. We additionally curated key information from all included studies such as validated variant class, distance-to-target gene, and experimental validation methods. Our findings have value for future experimental validation studies, target gene prioritization and functional variant prediction. The approaches utilized to validate coding variants as well as current methods used to nominate candidate functional variants for functional studies are outside the scope of this manuscript and have been reviewed previously [8, 9].

Methods

We conducted a systematic literature search and report it in compliance with the standards set forth by the 2020 PRISMA statement on the reporting of systematic reviews [13]. As a traditional keyword-based search approach would not enable us to thoroughly search for all relevant concepts and combinations, we leveraged natural language processing (NLP) and ontology-based text mining to ensure a systematic identification of relevant validation articles [14, 15]. We defined the scope to include studies that perform validation of GWAS associated non-coding variants at least at a molecular level.

In order to build a comprehensive literature search strategy, we first identified 28 validation studies from recent reviews and published resources [6, 7, 16]. These index studies were evaluated to identify the optimal keywords and concepts that would be used in the systematic literature search. Figure 1 shows a flow diagram summarizing the systematic literature search approach that was employed. The systematic literature search was conducted using search and filter concepts identified by thorough manual and text mining-supported concept analysis of index articles. The initial broad search was based on four different sub-queries aimed at identifying any articles that might include experimental validation of GWAS variants. We included explicit mention of GWAS, non-coding, functional or causal variant as well as contextual mentions of non-coding concepts such as enhancers and promoters (Additional file 1). Queries were run on MEDLINE Full Index [17] (all MEDLINE content until February 19, 2021) using IQVIA/Linguamatics I2E KNIME nodes [18]. Concepts and various combinations were searched in title, abstract and meta-data (author keywords, Medical Subject Headings (MeSH) terms and substances) leveraging public standard life science ontologies (such as MeSH [19], NCI Thesaurus [20] or Entrez Gene [21], custom vocabularies and syntactical rules, grammatical pattern and linguistic entity classes allowing to build more generalized (comprehensive) queries, but at the same time more precise queries than standard key word search engines. The PMIDs identified by each query were combined and filtered for publication year ≥ 2007 (using “PubMed Publication Data (entrez)”). After removing duplicates, we arrived at 36,676 unique articles (Fig. 1A). We built seven filters reflecting our key inclusion criteria to narrow down the search results: (1) filter for primary research articles and exclude other article types, (2) GWAS and/or association filter, (3) filter for any human disease, (4) filter for any human gene (RefSeq), (5) filter for explicit mention of “non-coding” or non-coding context (enhancers, intron, non-coding, microRNA, etc.), (6) filter for functional, causal, or regulatory variant or specific rsID, and (7) wet-lab experimental validation techniques (Fig. 1B, Additional file 2). Filters were built using an in-house entity extraction and literature classification pipeline combining SciBite’s TERMite (TERM identification, tagging & extraction) API coupled with SciBite’s VOCabs [22] and IQVIA/Linguamatics I2E Software.

Fig. 1
figure 1

Systematic literature search and validation approach. Flow diagram demonstrating the systematic literature search strategy starting with A broad Medline search including all potentially related articles. The search included several concepts related to GWAS, non-coding contexts and other related terms detailed in Additional file 1. B Using text-mining of article titles, abstracts and metadata, we built seven filters to narrow down the search results which excluded 35,222 articles. Exact search terms and their combinations used in the filters are provided in Additional file 2. C 1454 articles of interest that passed all the filters were manually screened and evaluated for eligibility. D Through manual curation an additional set of 579 articles was excluded. E 875 eligible articles that passed manual curation were annotated to identify key information from each study. F These articles proceeded to cross-referencing against the GWAS Catalog to ensure that the validated variants and their reported associated disease trait match known GWAS associations. G Cross-referencing excluded 598 articles with poor GWAS trait matches or no variant match. H The final systematic review includes 286 articles. Reasons for exclusion at each stage are shown in red on the right side and described in more detail in the main text

In total 1454 articles passed all filter criteria and were then manually reviewed by three curators (Fig. 1C). All articles had to meet the following criteria to be considered for inclusion: (1) investigate variants associated with a human disease, (2) include experimental wet-lab molecular validation of one or more variants, (3) include putative validation of at least one non-coding variant, and (4) investigate single nucleotide polymorphisms (SNPs), excluding indels, purely coding, somatic, or rare variants. Abstracts and full texts were reviewed resulting in the exclusion of 579 articles (Fig. 1D). Overall, this manual review identified 875 potentially relevant articles. All these articles were manually curated to confirm the rsID of the reportedly validated variants, variant class, the reported regulated gene, and the associated disease (Fig. 1E).

We then used the information on the validated variant’s rsID and disease trait to cross validate our data with the GWAS Catalog [5] (accessed Mar 25, 2021) to confirm that each curated variant-disease association is reported in a GWAS (Fig. 1F). Corresponding associations were identified through LD between the curated SNP and the reported GWAS Catalog SNP, and similarity between the reported GWAS trait and the traits extracted from the PubMed abstract as detailed below. Because the GWAS Catalog only reports the lead variant for each locus, and this variant is not necessarily identical to the causal variant for the association, we performed an LD expansion from each top SNP to identify additional possible causal variants. Broad ancestry as reported in the GWAS Catalog was mapped to a 1000 Genomes superpopulation following methods we described recently [23]. For each associated SNP in the GWAS Catalog, an LD expansion was performed to identify SNPs within 1 Mb with LD r2 ≥ 0.5 in the corresponding 1000 Genomes super-population. A minor allele count threshold of 5 within the corresponding superpopulation was applied to reduce the impact of high variance LD estimates for rare variants. If it was not possible to map to a single superpopulation, LD expansion was performed using the full 1000 Genomes Phase 3 GRCh38 liftover to match the build used in the GWAS Catalog [24]. When the GWAS Catalog reported a specific risk allele, our LD expansion took this into account, such that for multiallelic SNPs we would only identify variants correlated with the reported allele. The choice of LD threshold is motivated by the goal to capture GWAS associations that could plausibly be explained by the cataloged variant and has been used elsewhere[25]. Using this methodology, it was possible to perform LD expansion for 91% of variants in the GWAS Catalog. GWAS Catalog variants for which an LD expansion was not possible were still included in the analysis but could only be matched to the reported variant rather than other possible causal variants.

GWAS Catalog Experimental Factor Ontology (EFO) terms and disease terms curated from the literature were mapped to the 2020 MeSH thesaurus vocabulary using the approach outlined previously [26]. To allow for inexact matches in MeSH terms (e.g., hypertension and systolic blood pressure), we use two similarity metrics: Lin-Resnik average similarity with a cutoff value of 0.75 [26, 27] and odds ratio of MeSH term co-occurrence in the same PubMed article with a cutoff of 20 [23]. We count a match between an article identified in our systematic review and a GWAS study if any GWAS Catalog association satisfies the following criteria: (1) The reported variant in the GWAS Catalog has LD R2 ≥ 0.5 to at least one curated variant, and (2) the reported trait in the GWAS Catalog has similarity to a main or manually curated disease from the PubMed abstract, meeting or exceeding the cutoff value. We excluded 347 SNPs in 311 articles from the analysis due to not being linked to a GWAS Catalog SNP. A further 292 SNPs contained within 278 articles were excluded due to a poor match between the reported GWAS trait and the trait reported in the abstract (Fig. 1G). The final curated catalog includes 286 articles (Fig. 1H) [28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313].

Results

Curated catalog of 309 validated GWAS non-coding variants

Several prior studies have emphasized the importance of experimental validations to uncover the biological processes underlying the statistical GWAS associations [3, 6, 7, 314, 315]. The final list of 286 articles reports 309 experimentally validated functional non-coding variants regulating 252 genes across 130 human-diseases (Additional file 3 and Fig. 2). Additional File 3 includes several important aspects about the included articles and variants including PubMed identifiers (PMID), variant rsID, location, class, target gene as well as disease associations and experimental validation approaches. We examined several characteristics of the validated non-coding variants in relation to GWAS catalog studies and variants. Between 2007 and 2020 there is a steady increase in the number of validation articles over time up to the 286 we report here. In contrast, the total number of published GWAS articles is 4342 versus 286 validation articles for non-coding variants (Fig. 3A). Next, we evaluated the relationship between disease heritability explained by common SNPs and the ratio of validated variants to the total number of lead-GWAS variants. We mapped disease associations for all variants to the higher order disease categories in the MeSH terms tree structure. For heritability estimates, we considered liability scale h2 for UK Biobank phenotypes estimated using LD Score Regression[316, 317] which (1) mapped to a MeSH disease (2) were considered high or medium confidence and averaged the heritability across higher level MeSH to get average heritability per disease category. Using this approach, we find a statistically significant (p = 0.01; correlation coefficient 0.51) positive relationship between mean heritability and the ratio of validated/lead GWAS variants per disease category (Fig. 3B). Examination of individual validated variants showed the majority of validated variants are in strong LD with and in close proximity to the GWAS variant (Fig. 3C, D). Allele frequencies of validated variants have slightly skewed distribution with fewer validated variants having lower allele frequencies (Fig. 3E). Comparing the location of experimentally validated non-coding GWAS variants to GWAS lead variants, we found that validated variants are about equally likely to be located within a protein-coding gene (58% for functional variants versus 55% for GWAS lead variants). However, they are much more likely to be within 10 kb of a gene boundary (20% versus 11%) and much less likely to be more than 100 kb from the nearest gene (7% versus 16%) (Fig. 3F). Overall, these findings quantify the persistent need for more experimental validation studies to bridge the gap between association and biology. These findings also suggest that focusing experimental validation efforts to variants in close proximity and strong LD to the lead GWAS variant would lead to the identification of a causal variant in the majority of genetic loci.

Fig. 2
figure 2

Map of 309 validated GWAS non-coding variants. The Circos plot displays the 309 experimentally validated variants studied within the 286 included articles. The outer most layer (i) shows the validated variants’ 252 target genes, (ii) the chromosomal map, (iii) the location of validated variants marked by their rsIDs, (iv) using higher order ontology mapping, we display inner links between variants associated with diseases in the same category. Disease systems that contain ten or more validated variants are displayed while those contain less than ten validated variants are grouped in “Others” category, and (v) the manually annotated validated variant class. Additional File 3 contains all variant details and annotations

Fig. 3
figure 3

Functional validation remains the bottleneck of GWAS follow-up. A Comparison of the number of published studies in the GWAS catalog and non-coding variant validation studies over time. B Relationship between the ratio of validated non-coding variants to the total GWAS variants and disease category mean heritability. C Linkage disequilibrium between reported variant in GWAS Catalog and validated variants. D Distance between validated variant and GWAS Catalog-reported variant. E Global minor allele frequency (MAF) of validated variants in 1000 genomes phase 3. F Location of experimentally validated non-coding GWAS variants in relation to all protein-coding genes compared to GWAS lead variants

Validated variants regulate 252 target genes through a variety of mechanisms

Non-coding genetic variants can exert their effect on target genes through a variety of mechanisms [318,319,320]. We divided variants into three broad categories based on their mechanism of regulation: cis-regulatory element (CRE) variants, promoter variants and variants acting through non-coding RNAs (Fig. 4A). Promoter variants were grouped separately from other CREs because they are functionally distinct and in addition the methods utilized for their validation are different from other CREs. Below we highlight several exemplar studies validating variants across all these mechanisms and many diseases. Interestingly, the majority of non-coding variants identified in our catalog regulate genes through CREs (n = 215). These include variants in enhancers such as rs4420550-MAPK3-TAOK2 in schizophrenia [168], rs11236797-LRRC32 in inflammatory bowel disease [40], and rs9349379-EDN1 in vascular diseases [49]. Some variants exerted their effect through silencers such as rs12038474-CDC42 in endometriosis [130], rs2494737-AKT1 in endometrial carcinoma [37] and rs9508032-FLT1 in acute respiratory distress syndrome[267]. Additionally, rs12936231-GSDMB-ORMDL3-ZPBP2 seems to function through an insulator in an asthma and autoimmune disease risk locus [71].

Fig. 4
figure 4

Non-coding variants regulate 252 target genes through diverse mechanisms. A Illustration of some of the diverse mechanisms of regulation within each variant category. Examples of each mechanism from included studies are discussed in the text. B Cumulative number of validated variants grouped by non-coding variant categories over time. C We used Encode’s Biomart and hg38 to calculate the distance (in kb) between validated variants and their target gene’s closest transcription start site (TSS). Graph plots the number of variant- gene pairs grouped by variant class. Variants more than 200 kb away are plotted at 200 kb. D Distribution of CRE variants relative to their target gene. CRE = Cis-Regulatory Element, ncRNA = non-coding RNA

Variants in gene promoters can alter transcription factor binding and promoter activity. For example, rs1887428-JAK2 in inflammatory bowel disease [256], rs11789015-BARX1 in esophageal adenocarcinoma [88], rs4065275-ORMDL3 and rs8076131-ORMDL3 in asthma, [248] and rs11603334-ARAP1 in type 2 diabetes mellitus [34]. DNA methylation is an important epigenetic mechanism of gene regulation and increased DNA methylation at gene promoters can repress gene transcription [321, 322]. We identified several validated variants that appear to alter promoter methylation including rs780093-NRBP1 in gout [127], rs143383-GDF5 in osteoarthritis [119], and rs35705950-MUC5B in idiopathic pulmonary fibrosis [258]. Alternatively, variants could alter promoter and transcription start site usage. Examples for these mechanisms in our catalog include rs922483-BLK in systemic lupus erythematosus [302] and rs10465885-GJA5 in atrial fibrillation [32].

The third broad category by which variants from our catalog exert their regulatory effect is through non-coding RNAs [323]. microRNAs are a major and well-studied class of regulatory small non-coding RNAs. Variants in microRNAs are known to impact disease biology through post-transcriptional regulation of their target genes, primarily via 3’ untranslated region (UTR) binding [324,325,326]. GWAS variants located within microRNAs can alter their biogenesis, expression levels and/or target specificity, while variants located in target genes are capable of altering microRNA binding sites [326]. Examples of validated variants within microRNAs included in this catalog are miR-196a2 variant rs11614913 regulating SFMBT1 and HOXC8 in metabolic syndrome [277], and miR-4513 variant rs2168518 regulating GOSR2 in cardiometabolic diseases [51]. Given that microRNAs typically target hundreds to thousands of genes, it is very difficult to confidently assign target genes that are mediating the effect of a microRNA variant. On the other hand, studying variants located within mircoRNA-binding sites of target genes may yield more success in assigning underlying mechanisms [326, 327]. There are numerous examples of such variants reported in this catalog, such as rs5068 altering regulation of NPPA by miR-425 in hypertension [96], rs1058205 altering regulation of KLK3 by miR-3162-5p and rs1010 altering regulation of VAMP8 by miR-370 in prostate cancer [54], and rs372883 altering BACH1 regulation by miR-1257 in pancreatic ductal adenocarcinoma [174]. Another important class of non-coding RNAs is long non-coding RNAs that are recognized to play an important role in biology and disease [328, 329]. Some examples of long non-coding RNA variants in this catalog include rs6983267 in CCAT2 regulating cancer metabolism through allele-specific binding of CPSF7 [76] and rs2147578 in LAMC2-1 modulating microRNA binding to it in colorectal cancer [43]. We examined the distribution of these three broad categories of validated variants across publication dates. We observed a steady increase in the validation of promoter variants (n = 70) and variants acting through non-coding RNAs (n = 24) since 2007, but a sharp increase in the number of studies validating CRE variants around 2015. This trend persisted through 2020 to reach a total of 215 variants representing 70% of this catalog (Fig. 4B). We also characterized the distance between each validated variant and its target gene’s closest transcription start site according to variant category. As expected, promoter variants clustered immediately upstream or downstream of their target’s transcription start site. CRE variants were more widely distributed, but nevertheless, 157 (66%) of these fell within 50 kb from their target gene TSS. A notable example of a distally acting enhancer variant > 50 kb, is the obesity FTO locus variant rs1421085 regulating IRX3 and IRX5, which are 500 kb and 1,163 kb away respectively [147]. Since the majority of variants acting through non-coding RNAs identified in our catalog were located within 3’ UTRs, this group of variants tended to cluster within 100 kb downstream of gene transcript start sites (Fig. 4C). The dataset gave us the opportunity to examine the relationship between CRE variants and their target genes (n = 235 CRE variant-target gene pairs). Plotting the distribution of CRE variants based on their location relative to the target gene indicated that 41% of CRE variants are located within their target gene, and an additional 30% are intergenic and their target gene is the closest gene to the variant. 14% of CRE variants were intergenic and their target gene is not the closest gene, and the remaining 15% are located within a different gene than their target gene. (Fig. 4D). These results are interesting and provide greater support for consideration of same gene and nearby genes as candidate targets for CREs. These findings are also in agreement with recent empirical data [330, 331].

Next, using text mining, we extracted and analyzed the experimental methods that were used in each study to validate variants. We broadly classified them under six broad categories covering different types of established validation techniques and related terms: (1) gene expression, including eQTL and molecular assessment of target gene expression and allele specific regulation (n = 272 articles), (2) reporter assays, including luciferase and massively parallel reporter assays (n = 171 articles), (3) transcription factor binding, including chromatin immunoprecipitation and electrophoretic mobility shift assays (n = 175 articles), (4) in vivo or animal models (n = 104 articles), (5) genome editing, including CRISPR and TALEN (n = 96 articles), and (6) chromatin interaction, including chromosome conformation capture (n = 33 articles) [11]. We examined the number of these approaches that were utilized by the included studies and found that 189 (66%) of all articles utilized three or more approaches (Fig. 5). These results demonstrate the multifaceted approach needed for validation of non-coding variants [11].

Fig. 5
figure 5

Studies utilize multiple avenues in validating non-coding variants. Using text-mining of abstracts and metadata, we examined the utilization of different avenues for non-coding variant validation across 286 included articles. The six broad categories were gene expression, reporter assays, transcription factor binding, in vivo or animal models, genome editing, and chromatin interaction. The intersection size denotes the number of articles that have the combination of validation categories below it. The color denotes the number of avenues used; pink – 6, orange—5, green—4, black—3, blue—2, red—1. The upset plot shows the overlap of the variant validation avenues and the number of articles. The Set size bars on the right reflect the total number of studies that used/employed each of the categories

Discussion

GWAS have seen a remarkable growth in the past decade. The impact of GWAS on human healthcare is severely limited by the bottle neck of experimental validation of disease-associated variants. Here, we report the first systematic approach to curate all experimental validation studies of non-coding GWAS variants. While there is general recognition that experimental validation of GWAS are seriously lacking [7], this systematic assessment of (1) the number of published experimentally validated non-coding variants is quantified, (2) cataloged, and (3) methods used in identified studies analyzed.

Using a comprehensive approach, we employed natural-language processing-based text mining, manual curation and GWAS catalog cross validation. We have curated 286 validation studies that include 309 putatively validated variants regulating 252 genes across 130 diseases. We then evaluated several important characteristics of the identified variants and their relation to GWAS lead variants. The ratio of validated non-coding variants to total GWAS lead variants showed a positive correlation to the mean heritability of disease groups. This relationship could indicate greater success in validating variants in diseases with higher heritability perhaps because of greater individual contribution of these variants to the overall disease susceptibility. This could also potentially represent a greater interest of scientists to pursue validation of variants in more heritable diseases and with larger effect sizes, thus leading to greater proportion of variants being validated. However, we do not have enough data to directly address this possibility. We also evaluated the relationship in LD and distance between validated variants and GWAS lead variants. We find that ~ 70% of validated variants fall within 10 kb and r2 ≥ 0.9 with the lead GWAS variant. On one hand, this could reflect underlying genetics that most validated variants are in strong LD with lead GWAS variants and suggests that more productive research should be limited to SNPs in high LD and closer distance to lead GWAS variants. On the other hand, the status quo might be reflective of prior limits in search space already considered by scientists who performed validation studies, however we do not have data to support this possibility[8].

Next, we annotated variants into broad classes based on the mechanisms by which these non-coding variants acted. This identified several interesting patterns, such as an increase in the number of variants functioning through cis-regulatory elements over time. One explanation for this increase could be the growing awareness of the importance of these regulatory elements in human biology and disease which has led to the initiation of large projects aimed at identification, annotation and prioritization of non-coding regulatory elements [10, 320, 332]. Additionally, several SNP-enrichment analyses have demonstrated that GWAS variants are significantly enriched in active regulatory regions [314]. We expect this trend to continue with publications by larger consortia and projects that investigate regulatory elements in different life stages, tissues and biological conditions [332]. Interestingly, the majority of cis-regulatory element variants that we found appeared to act through transcriptional enhancers. This dominance of enhancer variants over other regulatory elements might be a result of enhancer elements having more clearly defined functions and biochemical markers (i.e., histone modification signatures) [333, 334]. This highlights the potential for increased discovery of GWAS variants acting through silencers and insulators as our understanding of their distinct biochemical signatures is refined and assayed in disease relevant cell types [333, 335].

Our comprehensive search and filter strategy enabled us to identify validated variants across a large number of complex human diseases and those that act through a myriad of mechanisms. Nevertheless, the systematic search was limited to the MEDLINE database. Relevant articles published in journals not indexed in this standard database for biomedical literature will be missing in our data set [336, 337]. For quality control and to identify limitations of our search and filter approach, we analyzed the recall of our index studies throughout the entire process (Fig. 1A–H). It is important to highlight that broadening the initial search to include non-coding contexts and association/locus instead of limiting to explicit mentions of non-coding and GWAS terms ensured identification of relevant studies that we had otherwise missed. A significant number of index articles did not explicitly mention these terms [48, 78, 134, 143, 147, 171, 178, 210, 230, 256, 302]. Our final broad search covered 27 out of the 28 index studies which demonstrates good search coverage. Through an iterative process, we narrowed down these results, trying to maximize the recall of index studies while maintaining a manageable number of articles for manual review. We are aware that the implemented stringent criteria bias the search to exclude true validation articles that did not mention any disease, protein or specific experimental validation terms [338,339,340,341,342,343,344,345]. Additionally, the tagging of the articles and normalization of concepts for filtering relies on accurate named entity recognition (NER) and ontologies. Even when using highly curated, enriched vocabularies and state-of-the-art NER routines, recall rates of at maximum 80–95% are assumed (depending on entity type). Overall, a total of 19 index studies passed all filtering stages and were included in the final catalog. Finally, the data of our curated catalog is mainly based on the publications’ abstract information. Only in cases where information was missing or unclear in the abstract did we gather data from the full text. Therefore, it is possible that information gathered from the final set of articles may be incomplete. This would have affected the experimental validation techniques analysis in particular, which was based only on abstract mining.

Construction of the catalog using controlled vocabularies for diseases, variants, genes, variant classes, and functional follow up methods is aimed to facilitate use in bioinformatics follow up analyses. We expect this resource to be useful in evaluating the performance of computational fine mapping and target prioritization methods. Quantifying the performance of these methods on real datasets has previously been hindered by a lack of true positive examples. A large dataset of true positive examples would allow researchers to computationally identify features associated with functional variation. Recent efforts to compile such true positive datasets and use them to train target prioritization methods have come with concerns about bias towards coding variation [16] or are aimed at a specific trait subset such as molecular phenotypes [346] or immune disease [347]. We expect this catalog to contribute a large number of much needed examples of functional noncoding variants in human disease and the genes on which they act. Despite this important contribution, bias towards nearby genes and variants to the top GWAS SNP is still a concern for our catalog due to the limited number of variants and genes evaluated in the cataloged studies. To generate an unbiased training set for computational methods, an ideal functional study following up on a GWAS association would consider all credible causal SNPs and their nearby genes, but studies in our catalog typically consider a more limited set of genes and SNPs. For example, eQTL variants may be shared among multiple transcripts [348], and in this scenario functional studies considering only a single gene could be misleading about the causal gene.

Conclusions

This review is the first to systematically evaluate the status and the landscape of experimentation being used to validate non-coding GWAS-identified variants. Our results clearly underscore the multifaceted approach needed for experimental validation. The findings of validated variants relationship to lead GWAS variants as well as to their target genes provide practical insights for future validation studies. Finally, we aim for the catalog to be a useful resource aiding in the development of prediction tools by providing a truth set of experimentally validated variants. Collectively this contributes to the overall effort to bridge the gap between genetic association and function in complex diseases.